[ https://issues.apache.org/jira/browse/FLINK-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171693#comment-16171693 ]
ASF GitHub Bot commented on FLINK-7449: --------------------------------------- Github user alpinegizmo commented on a diff in the pull request: https://github.com/apache/flink/pull/4543#discussion_r139684257 --- Diff: docs/ops/state/checkpoints.md --- @@ -99,3 +99,296 @@ above). ```sh $ bin/flink run -s :checkpointMetaDataPath [:runArgs] ``` + +## Incremental Checkpoints + +### Synopsis + +Incremental checkpoints can significantly reduce checkpointing time in comparison to full checkpoints, at the cost of a +(potentially) longer recovery time. The core idea is that incremental checkpoints only record changes in state since the +previously-completed checkpoint instead of producing a full, self-contained backup of the state backend. In this way, +incremental checkpoints can build upon previous checkpoints. + +RocksDBStateBackend is currently the only backend that supports incremental checkpoints. + +Flink leverages RocksDB's internal backup mechanism in a way that is self-consolidating over time. As a result, the +incremental checkpoint history in Flink does not grow indefinitely, and old checkpoints are eventually subsumed and +pruned automatically. + +``While we strongly encourage the use of incremental checkpoints for Flink jobs with large state, please note that this is +a new feature and currently not enabled by default``. + +To enable this feature, users can instantiate a `RocksDBStateBackend` with the corresponding boolean flag in the +constructor set to `true`, e.g.: + +```java + RocksDBStateBackend backend = + new RocksDBStateBackend(filebackend, true); +``` + +### Use-case for Incremental Checkpoints + +Checkpoints are the centrepiece of Flink’s fault tolerance mechanism and each checkpoint represents a consistent +snapshot of the distributed state of a Flink job from which the system can recover in case of a software or machine +failure (see [here]({{ site.baseurl }}/internals/stream_checkpointing.html). + +Flink creates checkpoints periodically to track the progress of a job so that, in case of failure, only those +(hopefully few) *events that have been processed after the last completed checkpoint* must be reprocessed from the data +source. The number of events that must be reprocessed has implications for recovery time, and so for fastest recovery, +we want to *take checkpoints as often as possible*. + +However, checkpoints are not without performance cost and can introduce *considerable overhead* to the system. This +overhead can lead to lower throughput and higher latency during the time that checkpoints are created. One reason is +that, traditionally, each checkpoint in Flink always represented the *complete state* of the job at the time of the +checkpoint, and all of the state had to be written to stable storage (typically some distributed file system) for every +single checkpoint. Writing multiple terabytes (or more) of state data for each checkpoint can obviously create +significant load for the I/O and network subsystems, on top of the normal load from pipeline’s data processing work. --- End diff -- on top of the normal load from the pipeline’s data processing work. (add "the") > Improve and enhance documentation for incremental checkpoints > ------------------------------------------------------------- > > Key: FLINK-7449 > URL: https://issues.apache.org/jira/browse/FLINK-7449 > Project: Flink > Issue Type: Improvement > Components: Documentation > Affects Versions: 1.4.0 > Reporter: Stefan Richter > Assignee: Stefan Richter > Priority: Minor > > We should provide more details about incremental checkpoints in the > documentation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)