[ 
https://issues.apache.org/jira/browse/FLINK-33012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760837#comment-17760837
 ] 

Matthias Pohl commented on FLINK-33012:
---------------------------------------

Thanks for reporting this issue. I looked into the logs. It's an infrastructure 
issue where the fsync didn't come back fast enough:
{code}
[...]
01:03:36,229 [        SyncThread:0] INFO  
org.apache.zookeeper.server.persistence.FileTxnLog           [] - Creating new 
log file: log.1
01:04:16,985 [        SyncThread:0] WARN  
org.apache.zookeeper.server.persistence.FileTxnLog           [] - fsync-ing the 
write ahead log in SyncThread:0 took 40754ms which will adversely effect 
operation latency.File size is 67108880 bytes. See the ZooKeeper 
troubleshooting guid
[...]
{code}
This delay of 40s caused the trouble and made the system unstable. I'm gonna 
attach the failure-related logs to this ticket. But we've see test 
instabilities due to the same cause in the past (e.g. FLINK-30108, 
FLINK-28523). I'm gonna check whether we can make the test more stable in this 
regard.

> LeaderElectionTest.testHasLeadership fails on AZP
> -------------------------------------------------
>
>                 Key: FLINK-33012
>                 URL: https://issues.apache.org/jira/browse/FLINK-33012
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.19.0
>            Reporter: Sergey Nuyanzin
>            Assignee: Matthias Pohl
>            Priority: Critical
>              Labels: test-stability
>
> This build 
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=52871&view=logs&j=675bf62c-8558-587e-2555-dcad13acefb5&t=5878eed3-cc1e-5b12-1ed0-9e7139ce0992&l=7151]
> fails as
> {noformat}
> Aug 31 01:04:17 Caused by: 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException:
>  KeeperErrorCode = ConnectionLoss
> Aug 31 01:04:17       at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> Aug 31 01:04:17       at 
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:885)
> Aug 31 01:04:17       at 
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:1025)
> Aug 31 01:04:17       at 
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:986)
> Aug 31 01:04:17       at 
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:97)
> Aug 31 01:04:17       at 
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:376)
> Aug 31 01:04:17       at 
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> Aug 31 01:04:17       at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
> Aug 31 01:04:17       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> Aug 31 01:04:17       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> Aug 31 01:04:17       at java.base/java.lang.Thread.run(Thread.java:833)
> Aug 31 01:04:17       Suppressed: 
> org.apache.flink.shaded.curator5.org.apache.curator.CuratorConnectionLossException:
>  KeeperErrorCode = ConnectionLoss
> Aug 31 01:04:17               at 
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:1031)
> Aug 31 01:04:17               ... 8 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to