[
https://issues.apache.org/jira/browse/FLINK-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130458#comment-17130458
]
Robert Metzger edited comment on FLINK-18148 at 6/10/20, 10:29 AM:
-------------------------------------------------------------------
Stop with savepoint has to complete within client.timeout = 60s (default).
Time breakdown:
checkpoint creation takes ~ 52 seconds
{code}
2020-06-10 05:02:49,464 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 3 @ 1591765369463 for job 327846815d4c49ae30f9bdc7352218bc.
2020-06-10 05:03:41,100 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 3 for job 327846815d4c49ae30f9bdc7352218bc (401806 bytes in 51636
ms).
{code}
Stopping the job takes ~ 5 minutes
{code}
2020-06-10 05:07:20,002 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job General
purpose test job (327846815d4c49ae30f9bdc7352218bc) switched from state RUNNING
to FINISHED.
{code}
The slow operators to stop are:
{code}
2020-06-10 05:03:41,175 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source:
EventSource -> Timestamps/Watermarks (1/4) (396d333e1197d54dfc96352bc4aa5db4)
switched from RUNNING to FINISHED.
2020-06-10 05:06:09,917 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (3/4)
(7dfc199016041880c09f9a43a2396c7e) switched from RUNNING to FINISHED.
2020-06-10 05:06:44,858 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (2/4)
(1d56a721f54ab6ae58c9782a2477d484) switched from RUNNING to FINISHED.
2020-06-10 05:07:09,917 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (4/4)
(9fa1b1913284b6a75b8021f9513ce932) switched from RUNNING to FINISHED.
2020-06-10 05:07:18,320 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (1/4)
(26f36930d6622198eb67711eaacae00c) switched from RUNNING to FINISHED.
{code}
>From the TM logs, it that the "Un-registration" is slow?!
{code}
2020-06-10 05:03:41,159 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] -
Un-registering task and sending final execution state FINISHED to JobManager
for task Source: EventSource -> Timestamps/Watermarks (1/4)
396d333e1197d54dfc96352bc4aa5db4.
2020-06-10 05:06:09,908 INFO
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend [] - Closed
RocksDB State Backend. Cleaning up RocksDB working directory
/tmp/flink-io-61f9aa77-daa2-4f83-9552-1f598583d156/job_327846815d4c49ae30f9bdc7352218bc_op_StreamMap_5271c210329e73bd743f3227edfb3b71__3_4__uuid_6e5a12d5-4906-4622-923e-70fb8eb9a23f.
2020-06-10 05:06:09,915 INFO org.apache.flink.runtime.taskmanager.Task
[] - ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (3/4)
(7dfc199016041880c09f9a43a2396c7e) switched from RUNNING to FINISHED.
{code}
was (Author: rmetzger):
Stop with savepoint has to complete within client.timeout = 60s (default).
Time breakdown:
checkpoint creation takes ~ 52 seconds
{code}
2020-06-10 05:02:49,464 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 3 @ 1591765369463 for job 327846815d4c49ae30f9bdc7352218bc.
2020-06-10 05:03:41,100 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 3 for job 327846815d4c49ae30f9bdc7352218bc (401806 bytes in 51636
ms).
{code}
Stopping the job takes ~ 5 minutes
{code}
2020-06-10 05:07:20,002 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job General
purpose test job (327846815d4c49ae30f9bdc7352218bc) switched from state RUNNING
to FINISHED.
{code}
The slow operators to stop are:
{code}
2020-06-10 05:03:41,175 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source:
EventSource -> Timestamps/Watermarks (1/4) (396d333e1197d54dfc96352bc4aa5db4)
switched from RUNNING to FINISHED.
2020-06-10 05:06:09,917 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (3/4)
(7dfc199016041880c09f9a43a2396c7e) switched from RUNNING to FINISHED.
2020-06-10 05:06:44,858 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (2/4)
(1d56a721f54ab6ae58c9782a2477d484) switched from RUNNING to FINISHED.
2020-06-10 05:07:09,917 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (4/4)
(9fa1b1913284b6a75b8021f9513ce932) switched from RUNNING to FINISHED.
2020-06-10 05:07:18,320 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (1/4)
(26f36930d6622198eb67711eaacae00c) switched from RUNNING to FINISHED.
{code}
>From the TM logs, it seems that the RocksDB cleanup is slow?
{code}
2020-06-10 05:03:41,159 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] -
Un-registering task and sending final execution state FINISHED to JobManager
for task Source: EventSource -> Timestamps/Watermarks (1/4)
396d333e1197d54dfc96352bc4aa5db4.
2020-06-10 05:06:09,908 INFO
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend [] - Closed
RocksDB State Backend. Cleaning up RocksDB working directory
/tmp/flink-io-61f9aa77-daa2-4f83-9552-1f598583d156/job_327846815d4c49ae30f9bdc7352218bc_op_StreamMap_5271c210329e73bd743f3227edfb3b71__3_4__uuid_6e5a12d5-4906-4622-923e-70fb8eb9a23f.
2020-06-10 05:06:09,915 INFO org.apache.flink.runtime.taskmanager.Task
[] - ArtificalKeyedStateMapper_Kryo_and_Custom_Stateful (3/4)
(7dfc199016041880c09f9a43a2396c7e) switched from RUNNING to FINISHED.
{code}
> "Resuming Savepoint" e2e fails with TimeoutException in CliFrontend.stop()
> ---------------------------------------------------------------------------
>
> Key: FLINK-18148
> URL: https://issues.apache.org/jira/browse/FLINK-18148
> Project: Flink
> Issue Type: Bug
> Components: Command Line Client
> Affects Versions: 1.11.0, 1.12.0
> Reporter: Robert Metzger
> Assignee: Robert Metzger
> Priority: Critical
> Labels: test-stability
> Fix For: 1.11.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2759&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5
> {code}
> ------------------------------------------------------------
> The program finished with the following exception:
> org.apache.flink.util.FlinkException: Could not stop with a savepoint job
> "081bda854bc250e01055ed1ba9d43178".
> at
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
> at
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
> at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
> at
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
> at
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
> at
> org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
> Caused by: java.util.concurrent.TimeoutException
> at
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
> at
> org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
> ... 6 more
> Waiting for job (081bda854bc250e01055ed1ba9d43178) to reach terminal state
> FINISHED ...
> Job (081bda854bc250e01055ed1ba9d43178) reached terminal state FINISHED
> Savepoint location was empty. This may mean that the stop-with-savepoint
> failed.
> [FAIL] Test script contains errors.
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)