[jira] [Created] (FLINK-28540) Unaligned checkpoint waiting in 'start delay' with AsyncDataStream

Nathan P Sharp (Jira) Wed, 13 Jul 2022 08:49:46 -0700

Nathan P Sharp created FLINK-28540:
--------------------------------------

             Summary: Unaligned checkpoint waiting in 'start delay' with 
AsyncDataStream
                 Key: FLINK-28540
                 URL: https://issues.apache.org/jira/browse/FLINK-28540
             Project: Flink
          Issue Type: Bug
    Affects Versions: 1.15.0
         Environment: Flink 1.15.0 using default options from 
[https://hub.docker.com/_/flink]


 
            Reporter: Nathan P Sharp
         Attachments: SingleCheckpointLongStartDelay.png

I am attempting to use unaligned checkpointing with AsyncDataStream, but the 
checkpoints sit in "start delay" until the job finishes.

I have published code that reproduces this to 
[https://github.com/phxnsharp/AsyncDataStreamCheckpointReproduction]

Reproduction steps:
 * Create a single node Docker swarm.
 * Run the docker-compose.yml file in the repository:
docker stack up -c docker-compose.yml flink
 * Use Flink's web UI to upload the .jar file and run it with default settings.

Expected behavior: Checkpoints happen about once per second since they are 
unaligned.

Actual behavior: After some number of failed checkpoints (the tasks are not 
running yet), a single checkpoint sits in "start delay" until the job finishes.

 
 * Searching the web seems to indicate the most common issue is asyncInvoke 
blocking. I added a test in the code to make sure that that is not true.

 * I have tried using rocksdb state backend, which did not help

 * I have tried adding additional TaskWorkers, which did not help

 * I have checked the TaskWorker stats and nothing seems awry. No memory 
consumption, for example. Nothing obvious in the stack traces

 * If I change the code to be sequential instead of async, checkpoints work fine

 * The log file merely shows the checkpoint being triggered, then it being 
completed 47 seconds later. No additional information is logged.

Mailing list conversation: 
[https://lists.apache.org/thread/2y3fb93zfsttq03z11xcnynf10xbpgnn]

Thank you!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-28540) Unaligned checkpoint waiting in 'start delay' with AsyncDataStream

Reply via email to