Nathan P Sharp created FLINK-28540: -------------------------------------- Summary: Unaligned checkpoint waiting in 'start delay' with AsyncDataStream Key: FLINK-28540 URL: https://issues.apache.org/jira/browse/FLINK-28540 Project: Flink Issue Type: Bug Affects Versions: 1.15.0 Environment: Flink 1.15.0 using default options from [https://hub.docker.com/_/flink]
Reporter: Nathan P Sharp Attachments: SingleCheckpointLongStartDelay.png I am attempting to use unaligned checkpointing with AsyncDataStream, but the checkpoints sit in "start delay" until the job finishes. I have published code that reproduces this to [https://github.com/phxnsharp/AsyncDataStreamCheckpointReproduction] Reproduction steps: * Create a single node Docker swarm. * Run the docker-compose.yml file in the repository: docker stack up -c docker-compose.yml flink * Use Flink's web UI to upload the .jar file and run it with default settings. Expected behavior: Checkpoints happen about once per second since they are unaligned. Actual behavior: After some number of failed checkpoints (the tasks are not running yet), a single checkpoint sits in "start delay" until the job finishes. * Searching the web seems to indicate the most common issue is asyncInvoke blocking. I added a test in the code to make sure that that is not true. * I have tried using rocksdb state backend, which did not help * I have tried adding additional TaskWorkers, which did not help * I have checked the TaskWorker stats and nothing seems awry. No memory consumption, for example. Nothing obvious in the stack traces * If I change the code to be sequential instead of async, checkpoints work fine * The log file merely shows the checkpoint being triggered, then it being completed 47 seconds later. No additional information is logged. Mailing list conversation: [https://lists.apache.org/thread/2y3fb93zfsttq03z11xcnynf10xbpgnn] Thank you! -- This message was sent by Atlassian Jira (v8.20.10#820010)