[
https://issues.apache.org/jira/browse/FLINK-26105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493257#comment-17493257
]
Matthias Pohl commented on FLINK-26105:
---------------------------------------
The failing assert checks the log files for the "Recovered JobGraph" (see
[common_ha.sh:68|https://github.com/apache/flink/blob/badce69548a30e77b1964fb570110c241e7703d5/flink-end-to-end-tests/test-scripts/common_ha.sh#L68])
and counts the number of files that contain this substring. 2 log files are
expected to contain this substring due to two JM failovers.
The JM logs reveal that there is a TM connection issue which makes the TM to
failover. In the meantime, the JM logs are polluted with
{{RecipientUnreachableException: Could not send message}} error messages which
result in the log rollover strategy to kick in. The older logs (including the
JM initialization) move into a {{*.log.1}} file which is not considered by the
assert in {{common_ha.sh}} resulting in the failover.
> Running HA (hashmap, async) end-to-end test failed on azure
> -----------------------------------------------------------
>
> Key: FLINK-26105
> URL: https://issues.apache.org/jira/browse/FLINK-26105
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.15.0
> Reporter: Yun Gao
> Assignee: Matthias Pohl
> Priority: Critical
> Labels: test-stability
>
> {code:java}
> Feb 14 01:31:29 Killed TM @ 255483
> Feb 14 01:31:29 Starting new TM.
> Feb 14 01:31:42 Killed TM @ 258722
> Feb 14 01:31:42 Starting new TM.
> Feb 14 01:32:00 Checking for non-empty .out files...
> Feb 14 01:32:00 No non-empty .out files.
> Feb 14 01:32:00 FAILURE: A JM did not take over.
> Feb 14 01:32:00 One or more tests FAILED.
> Feb 14 01:32:00 Stopping job timeout watchdog (with pid=250820)
> Feb 14 01:32:00 Killing JM watchdog @ 252644
> Feb 14 01:32:00 Killing TM watchdog @ 253262
> Feb 14 01:32:00 [FAIL] Test script contains errors.
> Feb 14 01:32:00 Checking of logs skipped.
> Feb 14 01:32:00
> Feb 14 01:32:00 [FAIL] 'Running HA (hashmap, async) end-to-end test' failed
> after 2 minutes and 51 seconds! Test exited with exit code 1
> Feb 14 01:32:00
> 01:32:00 ##[group]Environment Information
> Feb 14 01:32:01 Searching for .dump, .dumpstream and related files in
> '/home/vsts/work/1/s'
> dmesg: read kernel buffer failed: Operation not permitted
> Feb 14 01:32:06 Stopping taskexecutor daemon (pid: 259377) on host
> fv-az313-602.
> Feb 14 01:32:07 Stopping standalonesession daemon (pid: 256528) on host
> fv-az313-602.
> Feb 14 01:32:08 Stopping zookeeper...
> Feb 14 01:32:08 Stopping zookeeper daemon (pid: 251023) on host fv-az313-602.
> Feb 14 01:32:09 Skipping taskexecutor daemon (pid: 251636), because it is not
> running anymore on fv-az313-602.
> Feb 14 01:32:09 Skipping taskexecutor daemon (pid: 255483), because it is not
> running anymore on fv-az313-602.
> Feb 14 01:32:09 Skipping taskexecutor daemon (pid: 258722), because it is not
> running anymore on fv-az313-602.
> The STDIO streams did not close within 10 seconds of the exit event from
> process '/usr/bin/bash'. This may indicate a child process inherited the
> STDIO streams and has not yet exited.
> ##[error]Bash exited with code '1'.
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=31347&view=logs&j=e9d3d34f-3d15-59f4-0e3e-35067d100dfe&t=f8a6d3eb-38cf-5cca-9a99-d0badeb5fe62&l=8020
--
This message was sent by Atlassian Jira
(v8.20.1#820001)