[
https://issues.apache.org/jira/browse/FLINK-5142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephan Ewen resolved FLINK-5142.
---------------------------------
Resolution: Fixed
Fixed via e2c53cf85c1af73c040d96dbd24b9e2cf3e8cdf6
> Resource leak in CheckpointCoordinator
> --------------------------------------
>
> Key: FLINK-5142
> URL: https://issues.apache.org/jira/browse/FLINK-5142
> Project: Flink
> Issue Type: Bug
> Components: State Backends, Checkpointing
> Affects Versions: 1.1.1, 1.1.2, 1.1.3
> Reporter: Frank Lauterwald
> Assignee: Stephan Ewen
> Fix For: 1.1.4
>
>
> We run Flink 1.1.3 with a fairly aggressive time between checkpoints and a
> minimum interval between checkpoints to make sure that some work gets done
> between checkpoints.
> Over time, the JobManager uses more and more CPU time until it saturates the
> available cores. It does not show heavy I/O load and the task managers seem
> to work without problems.
> We see lots of log messages of the form "Trying to trigger another checkpoint
> while one was queued already" - sometimes multiple in the same millisecond.
> It seems like checkpoints are triggered way too often.
> I suspect there is a resource leak in the CheckpointCoordinator which leads
> to this behavior:
>
> // in triggerCheckpoint(long timestamp, long nextCheckpointId), line 414ff
> // introduced as part of FLINK-3492
> if (lastTriggeredCheckpoint + minPauseBetweenCheckpoints > timestamp) {
> if (currentPeriodicTrigger != null) {
> currentPeriodicTrigger.cancel();
> currentPeriodicTrigger = null;
> }
> ScheduledTrigger trigger = new ScheduledTrigger();
> timer.scheduleAtFixedRate(trigger, minPauseBetweenCheckpoints,
> baseInterval);
> return false;
> }
> The newly created trigger is not assigned to currentPeriodicTrigger, so it
> cannot be cancelled whenever another rescheduling is required.
> If rescheduling is common (it happens several times per minute for us), the
> running triggers accumulate until they overwhelm the JobManager.
> Versions up to Flink 1.0.x are unaffected because FLINK-3492 is a Flink 1.1
> feature.
> The issue seems to be already fixed in master by commit 8854d75c due to
> (unrelated) work on FLINK-4322.
> Let me know if there's anything else I can do to help.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)