Lucas Borges created FLINK-38325: ------------------------------------ Summary: Checkpoints are hanging and timing out frequently Key: FLINK-38325 URL: https://issues.apache.org/jira/browse/FLINK-38325 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 2.0.0, 2.1.0 Environment: Flink version 2.1 (also observed on 2.0) with Forst state backend. Running on kubernetes using the Flink apache kubernetes operator. Reporter: Lucas Borges Attachments: Screenshot 2025-09-03 at 14.53.56.png, Screenshot 2025-09-03 at 14.54.21.png, Screenshot 2025-09-03 at 14.54.36.png
This issue is being observed on a Flink 2.1 job running with Forst state backend. We noticed that checkpoints are failing due to timeouts/hanging more frequently than other Flink 1.x jobs. We suspect maybe there is a deadlock somewhere, based on one task-manager's thread dump (could not attach it to the Jira issue due to size limits). -- This message was sent by Atlassian Jira (v8.20.10#820010)