[jira] [Commented] (FLINK-21329) "Local recovery and sticky scheduling end-to-end test" does not finish within 600 seconds

Roman Khachatryan (Jira) Thu, 15 Apr 2021 03:34:27 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-21329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322061#comment-17322061
 ]


Roman Khachatryan commented on FLINK-21329:
-------------------------------------------

>From what I see in the logs:
 1. Test times out because the needed 10 checkpoints take too much time each 
(seconds in PASS case vs tens of seconds to minutes in FAIL case)
 2. Checkpoints take longer either because of either longer alignment (~20s) or 
 longer time to trigger checkpoint at the source (40-55s). Both indicate high 
back-pressure:
{code:java}
2021-02-08 22:27:13,093 DEBUG 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - 
Flat Map -> Sink: Unnamed (1/4)#3 - finished synchronous part of checkpoint 7. 
Alignment duration: 21525 ms, snapshot duration -1 ms, is unaligned checkpoint 
: false
{code}
{code:java}
2021-02-08 22:27:19,560 DEBUG 
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Trigger 
checkpoint 9@1612823239549 for 4343c6f1810b6f077dd688312b88e60c.
2021-02-08 22:28:06,724 DEBUG 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl [] - 
Source: Custom Source (2/4)#4 - finished synchronous part of checkpoint 9. 
Alignment duration: 0 ms, snapshot duration -1 ms, is unaligned checkpoint : 
false
{code}
Nor sync neither async phases take more than 5s.

 

Sources generate data without any delay currently, so back-pressure can happen 
very likely if the machine is slow. The test tries to perform 10 checkpoints in 
10 minutes; if each takes a bit more than 1 minute then it's likely to timeout. 
This can also explain why RocksDB fails and HashTable doesn't (the latter works 
faster).

 

So propose:
 1. Increase test timeout from 10m to 15m
 2. Induce delay between emitting elements in sources (currently none,  change 
to 100ms)
 3. Log time from receiving checkpoint RPC to actually executing it if it 
exceeds some threshold (say 10s, WARN level)

Turning on UC could eliminate long alignment but not the delay of triggering 
checkpoint at the sources. So wouldn't enable it and wait for it to become a 
default.
 

> "Local recovery and sticky scheduling end-to-end test" does not finish within 
> 600 seconds
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-21329
>                 URL: https://issues.apache.org/jira/browse/FLINK-21329
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Coordination
>    Affects Versions: 1.13.0
>            Reporter: Robert Metzger
>            Assignee: Roman Khachatryan
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.13.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13118&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529&l=38515
> {code}
> Feb 08 22:25:46 
> ==============================================================================
> Feb 08 22:25:46 Running 'Local recovery and sticky scheduling end-to-end test'
> Feb 08 22:25:46 
> ==============================================================================
> Feb 08 22:25:46 TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-46881214821
> Feb 08 22:25:47 Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.13-SNAPSHOT-bin/flink-1.13-SNAPSHOT
> Feb 08 22:25:47 Running local recovery test with configuration:
> Feb 08 22:25:47         parallelism: 4
> Feb 08 22:25:47         max attempts: 10
> Feb 08 22:25:47         backend: rocks
> Feb 08 22:25:47         incremental checkpoints: false
> Feb 08 22:25:47         kill JVM: false
> Feb 08 22:25:47 Starting zookeeper daemon on host fv-az127-394.
> Feb 08 22:25:47 Starting HA cluster with 1 masters.
> Feb 08 22:25:48 Starting standalonesession daemon on host fv-az127-394.
> Feb 08 22:25:49 Starting taskexecutor daemon on host fv-az127-394.
> Feb 08 22:25:49 Waiting for Dispatcher REST endpoint to come up...
> Feb 08 22:25:50 Waiting for Dispatcher REST endpoint to come up...
> Feb 08 22:25:51 Waiting for Dispatcher REST endpoint to come up...
> Feb 08 22:25:53 Waiting for Dispatcher REST endpoint to come up...
> Feb 08 22:25:54 Dispatcher REST endpoint is up.
> Feb 08 22:25:54 Started TM watchdog with PID 28961.
> Feb 08 22:25:58 Job has been submitted with JobID 
> e790e85a39040539f9386c0df7ca4812
> Feb 08 22:35:47 Test (pid: 27970) did not finish after 600 seconds.
> Feb 08 22:35:47 Printing Flink logs and killing it:
> {code}
> and
> {code}
>       at 
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver.unhandledError(ZooKeeperLeaderRetrievalDriver.java:184)
>       at 
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:713)
>       at 
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:709)
>       at 
> org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
>       at 
> org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
>       at 
> org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
>       at 
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:708)
>       at 
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:874)
>       at 
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:990)
>       at 
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943)
>       at 
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66)
>       at 
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException:
>  KeeperErrorCode = ConnectionLoss
>       at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>       at 
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)
>       ... 10 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-21329) "Local recovery and sticky scheduling end-to-end test" does not finish within 600 seconds

Reply via email to