[
https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lu Niu updated FLINK-16931:
---------------------------
Description:
When _metadata file is big, JobManager could never recover from checkpoint. It
fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is
related log:
{code:java}
2020-04-01 17:08:25,689 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
Recovering checkpoints from ZooKeeper.
2020-04-01 17:08:25,698 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 3
checkpoints in ZooKeeper.
2020-04-01 17:08:25,698 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to fetch 3 checkpoints from storage.
2020-04-01 17:08:25,698 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 50.
2020-04-01 17:08:48,589 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 51.
2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The
heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out.
{code}
Digging into the code, looks like ExecutionGraph::restart runs in JobMaster
main thread and finally calls
ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download
file form DFS. The main thread is basically blocked for a while because of
this. One possible solution is to making the downloading part async. More
things might need to consider as the original change tries to make it
single-threaded. [https://github.com/apache/flink/pull/7568]
was:
When _metadata file is big, JobManager could never recover from checkpoint. It
fall into a loop that fetch checkpoint -> JM timeout -> restart). Here is
related log:
{code:java}
2020-04-01 17:08:25,689 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
Recovering checkpoints from ZooKeeper.
2020-04-01 17:08:25,698 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 3
checkpoints in ZooKeeper.
2020-04-01 17:08:25,698 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to fetch 3 checkpoints from storage.
2020-04-01 17:08:25,698 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 50.
2020-04-01 17:08:48,589 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 51.
2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The
heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out.
{code}
Digging into the code, looks like ExecutionGraph::restart runs in JobMaster
main thread and finally calls
ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download
file form DFS. The main thread is basically blocked for a while because of
this. One possible solution is to making the downloading part async. More
things might need to consider as the original change tries to make it
single-threaded. [https://github.com/apache/flink/pull/7568]
> Large _metadata file lead to JobManager not responding when restart
> -------------------------------------------------------------------
>
> Key: FLINK-16931
> URL: https://issues.apache.org/jira/browse/FLINK-16931
> Project: Flink
> Issue Type: Bug
> Reporter: Lu Niu
> Priority: Major
>
> When _metadata file is big, JobManager could never recover from checkpoint.
> It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is
> related log:
> {code:java}
> 2020-04-01 17:08:25,689 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
> Recovering checkpoints from ZooKeeper.
> 2020-04-01 17:08:25,698 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found
> 3 checkpoints in ZooKeeper.
> 2020-04-01 17:08:25,698 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
> Trying to fetch 3 checkpoints from storage.
> 2020-04-01 17:08:25,698 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
> Trying to retrieve checkpoint 50.
> 2020-04-01 17:08:48,589 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
> Trying to retrieve checkpoint 51.
> 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The
> heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out.
> {code}
> Digging into the code, looks like ExecutionGraph::restart runs in JobMaster
> main thread and finally calls
> ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download
> file form DFS. The main thread is basically blocked for a while because of
> this. One possible solution is to making the downloading part async. More
> things might need to consider as the original change tries to make it
> single-threaded. [https://github.com/apache/flink/pull/7568]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)