Repository: flink Updated Branches: refs/heads/master 2841401a7 -> 6fb9ebc4f
[FLINK-5574] [docs] Add checkpoint monitoring docs Project: http://git-wip-us.apache.org/repos/asf/flink/repo Commit: http://git-wip-us.apache.org/repos/asf/flink/commit/6fb9ebc4 Tree: http://git-wip-us.apache.org/repos/asf/flink/tree/6fb9ebc4 Diff: http://git-wip-us.apache.org/repos/asf/flink/diff/6fb9ebc4 Branch: refs/heads/master Commit: 6fb9ebc4f291fadf75b86a77fdb7df66e2b9b46b Parents: 2841401 Author: Ufuk Celebi <[email protected]> Authored: Thu Jan 19 17:06:04 2017 +0100 Committer: Ufuk Celebi <[email protected]> Committed: Thu Jan 19 17:06:15 2017 +0100 ---------------------------------------------------------------------- docs/fig/checkpoint_monitoring-details.png | Bin 0 -> 94761 bytes .../checkpoint_monitoring-details_subtasks.png | Bin 0 -> 47710 bytes .../checkpoint_monitoring-details_summary.png | Bin 0 -> 50860 bytes docs/fig/checkpoint_monitoring-history.png | Bin 0 -> 83569 bytes docs/fig/checkpoint_monitoring-summary.png | Bin 0 -> 51257 bytes docs/monitoring/checkpoint_monitoring.md | 115 +++++++++++++++++++ 6 files changed, 115 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/flink/blob/6fb9ebc4/docs/fig/checkpoint_monitoring-details.png ---------------------------------------------------------------------- diff --git a/docs/fig/checkpoint_monitoring-details.png b/docs/fig/checkpoint_monitoring-details.png new file mode 100644 index 0000000..6b1bbbc Binary files /dev/null and b/docs/fig/checkpoint_monitoring-details.png differ http://git-wip-us.apache.org/repos/asf/flink/blob/6fb9ebc4/docs/fig/checkpoint_monitoring-details_subtasks.png ---------------------------------------------------------------------- diff --git a/docs/fig/checkpoint_monitoring-details_subtasks.png b/docs/fig/checkpoint_monitoring-details_subtasks.png new file mode 100644 index 0000000..703873e Binary files /dev/null and b/docs/fig/checkpoint_monitoring-details_subtasks.png differ http://git-wip-us.apache.org/repos/asf/flink/blob/6fb9ebc4/docs/fig/checkpoint_monitoring-details_summary.png ---------------------------------------------------------------------- diff --git a/docs/fig/checkpoint_monitoring-details_summary.png b/docs/fig/checkpoint_monitoring-details_summary.png new file mode 100644 index 0000000..f9503bf Binary files /dev/null and b/docs/fig/checkpoint_monitoring-details_summary.png differ http://git-wip-us.apache.org/repos/asf/flink/blob/6fb9ebc4/docs/fig/checkpoint_monitoring-history.png ---------------------------------------------------------------------- diff --git a/docs/fig/checkpoint_monitoring-history.png b/docs/fig/checkpoint_monitoring-history.png new file mode 100644 index 0000000..ec7d1df Binary files /dev/null and b/docs/fig/checkpoint_monitoring-history.png differ http://git-wip-us.apache.org/repos/asf/flink/blob/6fb9ebc4/docs/fig/checkpoint_monitoring-summary.png ---------------------------------------------------------------------- diff --git a/docs/fig/checkpoint_monitoring-summary.png b/docs/fig/checkpoint_monitoring-summary.png new file mode 100644 index 0000000..254dfe7 Binary files /dev/null and b/docs/fig/checkpoint_monitoring-summary.png differ http://git-wip-us.apache.org/repos/asf/flink/blob/6fb9ebc4/docs/monitoring/checkpoint_monitoring.md ---------------------------------------------------------------------- diff --git a/docs/monitoring/checkpoint_monitoring.md b/docs/monitoring/checkpoint_monitoring.md new file mode 100644 index 0000000..faa38f1 --- /dev/null +++ b/docs/monitoring/checkpoint_monitoring.md @@ -0,0 +1,115 @@ +--- +title: "Checkpoint Monitoring" +nav-parent_id: monitoring +nav-pos: 2 +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +* ToC +{:toc} + +## Overview + +Flink's web interface provides a tab to monitor the checkpoints of jobs. These stats are also available after the job has terminated. There are four different tabs to display information about your checkpoints: Overview, History, Summary, and Configuration. The following sections will cover all of these in turn. + +## Monitoring + +### Overview Tab + +The overview tabs lists the following statistics. Note that these statistics don't survive a JobManager loss and are reset to if your JobManager fails over. + +- **Checkpoint Counts** + - Triggered: The total number of checkpoints that have been triggered since the job started. + - In Progress: The current number of checkpoints that are in progress. + - Completed: The total number of successfully completed checkpoints since the job started. + - Failed: The total number of failed checkpoints since the job started. + - Restored: The number of restore operations since the job started. This also tells you how many times the job has restarted since submission. Note that the initial submission with a savepoint also counts as a restore and the count is reset if the JobManager was lost during operation. +- **Latest Completed Checkpoint**: The latest successfully completed checkpoints. Clicking on `More details` gives you detailed statistics down to the subtask level. +- **Latest Failed Checkpoint**: The latest failed checkpoint. Clicking on `More details` gives you detailed statistics down to the subtask level. +- **Latest Savepoint**: The latest triggered savepoint with its external path. Clicking on `More details` gives you detailed statistics down to the subtask level. +- **Latest Restore**: There are two types of restore operations. + - Restore from Checkpoint: We restored from a regular periodic checkpoint. + - Restore from Savepoint: We restored from a savepoint. + +### History Tab + +The checkpoint history keeps statistics about recently triggered checkpoints, including those that are currently in progress. + +<center> + <img src="{{ site.baseurl }}/fig/checkpoint_monitoring-history.png" width="700px" alt="Checkpoint Monitoring: History"> +</center> + +- **ID**: The ID of the triggered checkpoint. The IDs are incremented for each checkpoint, starting at 1. +- **Status**: The current status of the checkpoint, which is either *In Progress* (<i aria-hidden="true" class="fa fa-circle-o-notch fa-spin fa-fw"/>), *Completed* (<i aria-hidden="true" class="fa fa-check"/>), or *Failed* (<i aria-hidden="true" class="fa fa-remove"/>). If the triggered checkpoint is a savepoint, you will see a <i aria-hidden="true" class="fa fa-floppy-o"/> symbol. +- **Trigger Time**: The time when the checkpoint was triggered at the JobManager. +- **Latest Acknowledgement**: The time when the latest acknowledged for any subtask was received at the JobManager (or n/a if no acknowledgement received yet). +- **End to End Duration**: The duration from the trigger timestamp until the latest acknowledgement (or n/a if no acknowledgement received yet). This end to end duration for a complete checkpoint is determined by the last subtask that acknowledges the checkpoint. This time is usually larger than single subtasks need to actually checkpoint the state. +- **State Size**: The state size over all acknowledged subtasks. +- **Buffered During Alignment**: The number of bytes buffered during alignment over all acknowledged subtasks. This is only > 0 if a stream alignment takes place during checkpointing. If the checkpointing mode is `AT_LEAST_ONCE` this will always be zero as at least once mode does not require stream alignment. + +#### History Size Configuration + +You can configure the number of recent checkpoints that are remembered for the history via the following configuration key. The default is `10`. + +```sh +# Number of recent checkpoints that are remembered +jobmanager.web.checkpoints.history: 15 +``` + +### Summary Tab + +The summary computes a simple min/average/maximum statitics over all completed checkpoints for the End to End Duration, State Size, and Bytes Buffered During Alignment (see [History](#history) for details about what these mean). + +<center> + <img src="{{ site.baseurl }}/fig/checkpoint_monitoring-summary.png" width="700px" alt="Checkpoint Monitoring: Summary"> +</center> + +Note that these statistics don't survive a JobManager loss and are reset to if your JobManager fails over. + +### Configuration Tab + +The configuration list your streaming configuration: + +- **Checkpointing Mode**: Either *Exactly Once* or *At least Once*. +- **Interval**: The configured checkpointing interval. Trigger checkpoints in this interval. +- **Timeout**: Timeout after which a checkpoint is cancelled by the JobManager and a new checkpoint is triggered. +- **Minimum Pause Between Checkpoints**: Minimum required pause between checkpoints. After a checkpoint has completed successfully, we wait at least for this amount of time before triggering the next one, potentially delaying the regular interval. +- **Maximum Concurrent Checkpoints**: The maximum number of checkpoints that can be in progress concurrently. +- **Persist Checkpoints Externally**: Enabled or Disabled. If enabled, furthermore lists the cleanup config for externalized checkpoints (delete or retain on cancellation). + +### Checkpoint Details + +When you click on a *More details* link for a checkpoint, you get a Minumum/Average/Maximum summary over all its operators and also the detailed numbers per single subtask. + +<center> + <img src="{{ site.baseurl }}/fig/checkpoint_monitoring-details.png" width="700px" alt="Checkpoint Monitoring: Details"> +</center> + +#### Summary per Operator + +<center> + <img src="{{ site.baseurl }}/fig/checkpoint_monitoring-details_summary.png" width="700px" alt="Checkpoint Monitoring: Details Summary"> +</center> + +#### All Subtask Statistics + +<center> + <img src="{{ site.baseurl }}/fig/checkpoint_monitoring-details_subtasks.png" width="700px" alt="Checkpoint Monitoring: Subtasks"> +</center>
