[
https://issues.apache.org/jira/browse/FLINK-33012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760837#comment-17760837
]
Matthias Pohl commented on FLINK-33012:
---------------------------------------
Thanks for reporting this issue. I looked into the logs. It's an infrastructure
issue where the fsync didn't come back fast enough:
{code}
[...]
01:03:36,229 [ SyncThread:0] INFO
org.apache.zookeeper.server.persistence.FileTxnLog [] - Creating new
log file: log.1
01:04:16,985 [ SyncThread:0] WARN
org.apache.zookeeper.server.persistence.FileTxnLog [] - fsync-ing the
write ahead log in SyncThread:0 took 40754ms which will adversely effect
operation latency.File size is 67108880 bytes. See the ZooKeeper
troubleshooting guid
[...]
{code}
This delay of 40s caused the trouble and made the system unstable. I'm gonna
attach the failure-related logs to this ticket. But we've see test
instabilities due to the same cause in the past (e.g. FLINK-30108,
FLINK-28523). I'm gonna check whether we can make the test more stable in this
regard.
> LeaderElectionTest.testHasLeadership fails on AZP
> -------------------------------------------------
>
> Key: FLINK-33012
> URL: https://issues.apache.org/jira/browse/FLINK-33012
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.19.0
> Reporter: Sergey Nuyanzin
> Assignee: Matthias Pohl
> Priority: Critical
> Labels: test-stability
>
> This build
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=52871&view=logs&j=675bf62c-8558-587e-2555-dcad13acefb5&t=5878eed3-cc1e-5b12-1ed0-9e7139ce0992&l=7151]
> fails as
> {noformat}
> Aug 31 01:04:17 Caused by:
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss
> Aug 31 01:04:17 at
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> Aug 31 01:04:17 at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:885)
> Aug 31 01:04:17 at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:1025)
> Aug 31 01:04:17 at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:986)
> Aug 31 01:04:17 at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:97)
> Aug 31 01:04:17 at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:376)
> Aug 31 01:04:17 at
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> Aug 31 01:04:17 at
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
> Aug 31 01:04:17 at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> Aug 31 01:04:17 at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> Aug 31 01:04:17 at java.base/java.lang.Thread.run(Thread.java:833)
> Aug 31 01:04:17 Suppressed:
> org.apache.flink.shaded.curator5.org.apache.curator.CuratorConnectionLossException:
> KeeperErrorCode = ConnectionLoss
> Aug 31 01:04:17 at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:1031)
> Aug 31 01:04:17 ... 8 more
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)