[ 
https://issues.apache.org/jira/browse/FLINK-30108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17656093#comment-17656093
 ] 

Matthias Pohl commented on FLINK-30108:
---------------------------------------

The following line at {{00:59:34,533}} in {{zookeeper-server-3.log}} might give 
us some indication:
{code}
00:59:34,533 [        SyncThread:0] WARN  
org.apache.zookeeper.server.persistence.FileTxnLog           [] - fsync-ing the 
write ahead log in SyncThread:0 took 1261ms which will adversely effect 
operation latency. File size is 67108880 bytes. See the ZooKeeper 
troubleshooting guide
{code}

It looks like writing to disk was kind of slow and took longer (1261ms) than 
the negotiated timeout of 1000ms. My initial thought is to increase the 
minSessionTimeout to reduce the risk of running into this. But it's weird that 
we have this timeout of the application/test happening rather than observing a 
test failure

> ZooKeeperLeaderElectionConnectionHandlingTest.testLoseLeadershipOnLostConnectionIfTolerateSuspendedConnectionsIsEnabled
>  times out
> ---------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-30108
>                 URL: https://issues.apache.org/jira/browse/FLINK-30108
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Tests
>    Affects Versions: 1.17.0
>            Reporter: Leonard Xu
>            Priority: Major
>              Labels: test-stability
>         Attachments: zookeeper-server.FLINK-30108.log
>
>
> {noformat}
> Nov 18 01:02:58 [INFO] Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, 
> Time elapsed: 109.22 s - in 
> org.apache.flink.runtime.operators.hash.InPlaceMutableHashTableTest
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 Process produced no output for 900 seconds.
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 The following Java processes are running (JPS)
> Nov 18 01:18:09 
> ==============================================================================
> Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError
> Nov 18 01:18:09 924 Launcher
> Nov 18 01:18:09 23421 surefirebooter1178962604207099497.jar
> Nov 18 01:18:09 11885 Jps
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 Printing stack trace of Java process 924
> Nov 18 01:18:09 
> ==============================================================================
> Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError
> Nov 18 01:18:09 2022-11-18 01:18:09
> Nov 18 01:18:09 Full thread dump OpenJDK 64-Bit Server VM (25.292-b10 mixed 
> mode):
> ...
> ...
> ...
> Nov 18 01:18:09 
> ==============================================================================
> Nov 18 01:18:09 Printing stack trace of Java process 11885
> Nov 18 01:18:09 
> ==============================================================================
> 11885: No such process
> Nov 18 01:18:09 Killing process with pid=923 and all descendants
> /__w/2/s/tools/ci/watchdog.sh: line 113:   923 Terminated              $cmd
> Nov 18 01:18:10 Process exited with EXIT CODE: 143.
> Nov 18 01:18:10 Trying to KILL watchdog (919).
> Nov 18 01:18:10 Searching for .dump, .dumpstream and related files in 
> '/__w/2/s'
> Nov 18 01:18:16 Moving 
> '/__w/2/s/flink-runtime/target/surefire-reports/2022-11-18T00-55-55_041-jvmRun3.dumpstream'
>  to target directory ('/__w/_temp/debug_files')
> Nov 18 01:18:16 Moving 
> '/__w/2/s/flink-runtime/target/surefire-reports/2022-11-18T00-55-55_041-jvmRun3.dump'
>  to target directory ('/__w/_temp/debug_files')
> The STDIO streams did not close within 10 seconds of the exit event from 
> process '/bin/bash'. This may indicate a child process inherited the STDIO 
> streams and has not yet exited.
> ##[error]Bash exited with code '143'.
> Finishing: Test - core
> {noformat}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=43277&view=logs&j=0e7be18f-84f2-53f0-a32d-4a5e4a174679&t=7c1d86e3-35bd-5fd5-3b7c-30c126a78702



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to