[ 
https://issues.apache.org/jira/browse/FLINK-30108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656155#comment-17656155
 ] 

Matthias Pohl edited comment on FLINK-30108 at 1/9/23 3:35 PM:
---------------------------------------------------------------

The test itself gets stuck in 
[contender.awaitGrantLeadership()|https://github.com/apache/flink/blob/c60eb0c3b4bf7dc045dd7a1da2080c7befebb8dc/flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionConnectionHandlingTest.java#L147]
 according to the [thread 
dump|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43277&view=logs&j=0e7be18f-84f2-53f0-a32d-4a5e4a174679&t=7c1d86e3-35bd-5fd5-3b7c-30c126a78702&l=9944].
 I still don't understand, why we don't pickup leadership anymore.

I extract the relevant logs from each of the files (zookeeper-server-3.log, 
zookeeper-client-3.log, mvn-3.log) and merged it all into one sorting it based 
on its timestamp to get a better understanding of what's happening when. I used 
the following command (for reproducibility):
{code}
$ cat <(cat zookeeper-server.FLINK-30108.log| xargs -I'{}' echo 'server # {}') 
<(cat zookeeper-client.FLINK-30108.log | xargs -I'{}' echo 'client # {}') <(cat 
mvn.FLINK-30108.log| xargs -I'{}' echo 'test   # {}') | sort -t'#' -k2,2
{code}
...but the resulting file {{all.F LINK-30108.log}} is also to attached archive. 
(some of the lines might be in wrong order, e.g. for the lines without 
timestamp or with the same timestamp, but it's good enough to get an 
understanding of what's going on).

I'm still haven't figured out what the last line at 00:59:45,201 means:
{code}
est   # 00:59:45,201 [ Curator-Framework-0] INFO  
org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.ProtectedMode
 [] - Session has changed during protected mode with ephemeral. old: 
72362230717874177 new: 72362230717874178
{code}

[~zhuzh] can you get something out of it?


was (Author: mapohl):
The test itself gets stuck in 
[contender.awaitGrantLeadership()|https://github.com/apache/flink/blob/c60eb0c3b4bf7dc045dd7a1da2080c7befebb8dc/flink-runtime/src/test/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionConnectionHandlingTest.java#L147]
 according to the [thread 
dump|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43277&view=logs&j=0e7be18f-84f2-53f0-a32d-4a5e4a174679&t=7c1d86e3-35bd-5fd5-3b7c-30c126a78702&l=9944].
 I still don't understand, why we don't pickup leadership anymore.

I extract the relevant logs from each of the files (zookeeper-server-3.log, 
zookeeper-client-3.log, mvn-3.log) and merged it all into one sorting it based 
on its timestamp to get a better understanding of what's happening when. I used 
the following command (for reproducibility):
{code}
$ cat <(cat zookeeper-server.FLINK-30108.log| xargs -I'{}' echo 'server # {}') 
<(cat zookeeper-client.FLINK-30108.log | xargs -I'{}' echo 'client # {}') <(cat 
mvn.FLINK-30108.log| xargs -I'{}' echo 'test   # {}') | sort -t'#' -k2,2
{code}
...but the resulting file {{all.F LINK-30108.log}} is also to attached archive. 
(some of the lines might be in wrong order but it's good enough to get an 
understanding of what's going on).

> ZooKeeperLeaderElectionConnectionHandlingTest.testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled
>  times out
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-30108
>                 URL: https://issues.apache.org/jira/browse/FLINK-30108
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Tests
>    Affects Versions: 1.17.0
>            Reporter: Leonard Xu
>            Priority: Major
>              Labels: test-stability
>         Attachments: FLINK-30108.tar.gz, zookeeper-server.FLINK-30108.log
>
>
> {noformat}
> Nov 18 01:02:58 [INFO] Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, 
> Time elapsed: 109.22 s - in 
> org.apache.flink.runtime.operators.hash.InPlaceMutableHashTableTest
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 Process produced no output for 900 seconds.
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 The following Java processes are running (JPS)
> Nov 18 01:18:09 
> ==============================================================================
> Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError
> Nov 18 01:18:09 924 Launcher
> Nov 18 01:18:09 23421 surefirebooter1178962604207099497.jar
> Nov 18 01:18:09 11885 Jps
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 Printing stack trace of Java process 924
> Nov 18 01:18:09 
> ==============================================================================
> Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError
> Nov 18 01:18:09 2022-11-18 01:18:09
> Nov 18 01:18:09 Full thread dump OpenJDK 64-Bit Server VM (25.292-b10 mixed 
> mode):
> ...
> ...
> ...
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 Printing stack trace of Java process 11885
> Nov 18 01:18:09 
> ==============================================================================
> 11885: No such process
> Nov 18 01:18:09 Killing process with pid=923 and all descendants
> /__w/2/s/tools/ci/watchdog.sh: line 113:   923 Terminated              $cmd
> Nov 18 01:18:10 Process exited with EXIT CODE: 143.
> Nov 18 01:18:10 Trying to KILL watchdog (919).
> Nov 18 01:18:10 Searching for .dump, .dumpstream and related files in 
> '/__w/2/s'
> Nov 18 01:18:16 Moving 
> '/__w/2/s/flink-runtime/target/surefire-reports/2022-11-18T00-55-55_041-jvmRun3.dumpstream'
>  to target directory ('/__w/_temp/debug_files')
> Nov 18 01:18:16 Moving 
> '/__w/2/s/flink-runtime/target/surefire-reports/2022-11-18T00-55-55_041-jvmRun3.dump'
>  to target directory ('/__w/_temp/debug_files')
> The STDIO streams did not close within 10 seconds of the exit event from 
> process '/bin/bash'. This may indicate a child process inherited the STDIO 
> streams and has not yet exited.
> ##[error]Bash exited with code '143'.
> Finishing: Test - core
> {noformat}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43277&view=logs&j=0e7be18f-84f2-53f0-a32d-4a5e4a174679&t=7c1d86e3-35bd-5fd5-3b7c-30c126a78702



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to