alpinegizmo commented on a change in pull request #18765: URL: https://github.com/apache/flink/pull/18765#discussion_r827964351
########## File path: docs/content/docs/ops/state/checkpoints_vs_savepoints.md ########## @@ -0,0 +1,97 @@ +--- +title: "Checkpoints VS Savepoints" +weight: 10 +type: docs +aliases: + - /ops/state/checkpoints_vs_savepoints.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Checkpoints VS Savepoints + +## Overview + +Conceptually, Flink's [Savepoints]({{< ref "docs/ops/state/savepoints" >}}) are different from [Checkpoints]({{< ref "docs/ops/state/checkpoints" >}}) +in a similar way that backups are different from recovery logs in traditional database systems. Review comment: ```suggestion Conceptually, Flink's [Savepoints]({{< ref "docs/ops/state/savepoints" >}}) are different from [Checkpoints]({{< ref "docs/ops/state/checkpoints" >}}) in a way that's analogous to how backups are different from recovery logs in traditional database systems. ``` ########## File path: docs/content/docs/ops/state/checkpoints_vs_savepoints.md ########## @@ -0,0 +1,97 @@ +--- +title: "Checkpoints VS Savepoints" +weight: 10 +type: docs +aliases: + - /ops/state/checkpoints_vs_savepoints.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Checkpoints VS Savepoints + +## Overview + +Conceptually, Flink's [Savepoints]({{< ref "docs/ops/state/savepoints" >}}) are different from [Checkpoints]({{< ref "docs/ops/state/checkpoints" >}}) +in a similar way that backups are different from recovery logs in traditional database systems. + +The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. +A [Checkpoint's lifecycle]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is managed by Flink, +i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. +As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are Review comment: ```suggestion Because Checkpoints are being triggered often, and are relied upon for failure recovery, the two main design goals for the Checkpoint implementation are ``` ########## File path: docs/content/docs/ops/state/checkpoints_vs_savepoints.md ########## @@ -0,0 +1,97 @@ +--- +title: "Checkpoints VS Savepoints" +weight: 10 +type: docs +aliases: + - /ops/state/checkpoints_vs_savepoints.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Checkpoints VS Savepoints + +## Overview + +Conceptually, Flink's [Savepoints]({{< ref "docs/ops/state/savepoints" >}}) are different from [Checkpoints]({{< ref "docs/ops/state/checkpoints" >}}) +in a similar way that backups are different from recovery logs in traditional database systems. + +The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. +A [Checkpoint's lifecycle]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is managed by Flink, +i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. +As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are +i) being as lightweight to create and ii) being as fast to restore from as possible. +Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn't change between the execution attempts. +Checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints). + +{{< hint info >}} +- Checkpoints are automatically deleted if the application is terminated by the user +(except if checkpoints are explicitly configured to be retained). +- Checkpoints are stored in state backend-specific (native) data format (may be incremental depending on the specific backend). +{{< /hint >}} + +Although [savepoints]({{< ref "docs/ops/state/savepoints" >}}) are created internally with the same mechanisms as +checkpoints, they are conceptually different and can be a bit more expensive to produce and restore from and focus +more on portability and flexibility with respect to changes to the job. +Their use case is planned, manual backup and resume. For example, this could be an update of your Flink version, changing your job graph, +changing parallelism, forking a second job like for a red/blue deployment, and so on. + +{{< hint info >}} +- Savepoints are created, owned and deleted solely by the user. +That means, Flink does not delete savepoints neither after job termination nor after +restore. +- Savepoints are stored in a state backend independent (canonical) format (Note: Since Flink 1.15, savepoints can be also stored in +the backend-specific [native]({{< ref "docs/ops/state/savepoints" >}}#savepoint-format) format which is faster to create +and restore but comes with some limitations. +{{< /hint >}} + +### Capabilities and limitations +The following table gives an overview of capabilities and limitations for the various types of savepoints and +checkpoints. +- ✓ - Flink fully support this type of the snapshot +- x - Flink doesn't support this type of the snapshot +- ! - in fact, Flink support this type but officially, Flink doesn't support this type of the snapshot(there is certain level of risk to use it) + +| Operation | Canonical Savepoint | Native Savepoint | Aligned Checkpoint | Unaligned Checkpoint | +|:--------------------------------|:--------------------|:-----------------|:-------------------|:---------------------| +| State backend change | ✓ | x | x | x | +| State Processor API(writing) | ✓ | x | x | x | +| State Processor API(reading) | ✓ | ! | ! | x | +| Self-contained and relocatable | ✓ | ✓ | x | x | +| Schema evolution | ✓ | ! | ! | ! | +| Arbitrary job upgrade | ✓ | ✓ | ✓ | x | +| Non-arbitrary job upgrade | ✓ | ✓ | ✓ | x | +| Flink minor version upgrade | ✓ | ✓ | ✓ | x | +| Flink bug/patch version upgrade | ✓ | ✓ | ✓ | ✓ | +| Rescaling | ✓ | ✓ | ✓ | ✓ | + +- [State backend change]({{< ref "docs/ops/state/state_backends" >}}) - configuring different State Backend via `state.backend` parameter than it was during the taking snapshot. Review comment: I don't think it matters how the state backend was configured. ```suggestion - [State backend change]({{< ref "docs/ops/state/state_backends" >}}) - configuring a different State Backend than was used when taking the snapshot. ``` ########## File path: docs/content/docs/ops/state/checkpoints_vs_savepoints.md ########## @@ -0,0 +1,97 @@ +--- +title: "Checkpoints VS Savepoints" +weight: 10 +type: docs +aliases: + - /ops/state/checkpoints_vs_savepoints.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Checkpoints VS Savepoints + +## Overview + +Conceptually, Flink's [Savepoints]({{< ref "docs/ops/state/savepoints" >}}) are different from [Checkpoints]({{< ref "docs/ops/state/checkpoints" >}}) +in a similar way that backups are different from recovery logs in traditional database systems. + +The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. +A [Checkpoint's lifecycle]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is managed by Flink, +i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. +As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are +i) being as lightweight to create and ii) being as fast to restore from as possible. +Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn't change between the execution attempts. +Checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints). + +{{< hint info >}} +- Checkpoints are automatically deleted if the application is terminated by the user +(except if checkpoints are explicitly configured to be retained). +- Checkpoints are stored in state backend-specific (native) data format (may be incremental depending on the specific backend). +{{< /hint >}} + +Although [savepoints]({{< ref "docs/ops/state/savepoints" >}}) are created internally with the same mechanisms as +checkpoints, they are conceptually different and can be a bit more expensive to produce and restore from and focus +more on portability and flexibility with respect to changes to the job. +Their use case is planned, manual backup and resume. For example, this could be an update of your Flink version, changing your job graph, +changing parallelism, forking a second job like for a red/blue deployment, and so on. + +{{< hint info >}} +- Savepoints are created, owned and deleted solely by the user. +That means, Flink does not delete savepoints neither after job termination nor after +restore. +- Savepoints are stored in a state backend independent (canonical) format (Note: Since Flink 1.15, savepoints can be also stored in +the backend-specific [native]({{< ref "docs/ops/state/savepoints" >}}#savepoint-format) format which is faster to create +and restore but comes with some limitations. +{{< /hint >}} + +### Capabilities and limitations +The following table gives an overview of capabilities and limitations for the various types of savepoints and +checkpoints. +- ✓ - Flink fully support this type of the snapshot +- x - Flink doesn't support this type of the snapshot +- ! - in fact, Flink support this type but officially, Flink doesn't support this type of the snapshot(there is certain level of risk to use it) + +| Operation | Canonical Savepoint | Native Savepoint | Aligned Checkpoint | Unaligned Checkpoint | +|:--------------------------------|:--------------------|:-----------------|:-------------------|:---------------------| +| State backend change | ✓ | x | x | x | +| State Processor API(writing) | ✓ | x | x | x | +| State Processor API(reading) | ✓ | ! | ! | x | +| Self-contained and relocatable | ✓ | ✓ | x | x | +| Schema evolution | ✓ | ! | ! | ! | +| Arbitrary job upgrade | ✓ | ✓ | ✓ | x | +| Non-arbitrary job upgrade | ✓ | ✓ | ✓ | x | +| Flink minor version upgrade | ✓ | ✓ | ✓ | x | +| Flink bug/patch version upgrade | ✓ | ✓ | ✓ | ✓ | +| Rescaling | ✓ | ✓ | ✓ | ✓ | + +- [State backend change]({{< ref "docs/ops/state/state_backends" >}}) - configuring different State Backend via `state.backend` parameter than it was during the taking snapshot. +- [State Processor API (writing)]({{< ref "docs/libs/state_processor_api" >}}#writing-new-savepoints) - the ability to create new snapshot via State Processor API. +- [State Processor API (reading)]({{< ref "docs/libs/state_processor_api" >}}#reading-state) - the ability to read states from the existing snapshot via State Processor API. +- Self-contained and relocatable - the one snapshot folder contains everything it needs for recovery +and it doesn't depend on other snapshots which means it can be easily moved to another place if needed. +- [Schema evolution]({{< ref "docs/dev/datastream/fault-tolerance/serialization/schema_evolution" >}}) - changing the *state* data type. +- Arbitrary job upgrade - restoring the snapshot with the different [partitioning type]({{< ref "docs/dev/datastream/operators/overview" >}}#physical-partitioning)(rescale, rebalance, map, etc.) +or with the different record type for the existing operator. Review comment: ```suggestion - Arbitrary job upgrade - the snapshot can be restored even if the [partitioning types]({{< ref "docs/dev/datastream/operators/overview" >}}#physical-partitioning)(rescale, rebalance, map, etc.) or in-flight record types for the existing operators have changed. ``` ########## File path: docs/content/docs/ops/state/checkpoints_vs_savepoints.md ########## @@ -0,0 +1,97 @@ +--- +title: "Checkpoints VS Savepoints" +weight: 10 +type: docs +aliases: + - /ops/state/checkpoints_vs_savepoints.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Checkpoints VS Savepoints + +## Overview + +Conceptually, Flink's [Savepoints]({{< ref "docs/ops/state/savepoints" >}}) are different from [Checkpoints]({{< ref "docs/ops/state/checkpoints" >}}) +in a similar way that backups are different from recovery logs in traditional database systems. + +The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. +A [Checkpoint's lifecycle]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is managed by Flink, +i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. +As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are +i) being as lightweight to create and ii) being as fast to restore from as possible. +Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn't change between the execution attempts. +Checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints). + +{{< hint info >}} +- Checkpoints are automatically deleted if the application is terminated by the user +(except if checkpoints are explicitly configured to be retained). +- Checkpoints are stored in state backend-specific (native) data format (may be incremental depending on the specific backend). +{{< /hint >}} + +Although [savepoints]({{< ref "docs/ops/state/savepoints" >}}) are created internally with the same mechanisms as +checkpoints, they are conceptually different and can be a bit more expensive to produce and restore from and focus +more on portability and flexibility with respect to changes to the job. +Their use case is planned, manual backup and resume. For example, this could be an update of your Flink version, changing your job graph, +changing parallelism, forking a second job like for a red/blue deployment, and so on. + +{{< hint info >}} +- Savepoints are created, owned and deleted solely by the user. +That means, Flink does not delete savepoints neither after job termination nor after +restore. +- Savepoints are stored in a state backend independent (canonical) format (Note: Since Flink 1.15, savepoints can be also stored in +the backend-specific [native]({{< ref "docs/ops/state/savepoints" >}}#savepoint-format) format which is faster to create +and restore but comes with some limitations. +{{< /hint >}} + +### Capabilities and limitations +The following table gives an overview of capabilities and limitations for the various types of savepoints and +checkpoints. +- ✓ - Flink fully support this type of the snapshot +- x - Flink doesn't support this type of the snapshot +- ! - in fact, Flink support this type but officially, Flink doesn't support this type of the snapshot(there is certain level of risk to use it) + +| Operation | Canonical Savepoint | Native Savepoint | Aligned Checkpoint | Unaligned Checkpoint | +|:--------------------------------|:--------------------|:-----------------|:-------------------|:---------------------| +| State backend change | ✓ | x | x | x | +| State Processor API(writing) | ✓ | x | x | x | +| State Processor API(reading) | ✓ | ! | ! | x | +| Self-contained and relocatable | ✓ | ✓ | x | x | +| Schema evolution | ✓ | ! | ! | ! | +| Arbitrary job upgrade | ✓ | ✓ | ✓ | x | +| Non-arbitrary job upgrade | ✓ | ✓ | ✓ | x | +| Flink minor version upgrade | ✓ | ✓ | ✓ | x | +| Flink bug/patch version upgrade | ✓ | ✓ | ✓ | ✓ | +| Rescaling | ✓ | ✓ | ✓ | ✓ | + +- [State backend change]({{< ref "docs/ops/state/state_backends" >}}) - configuring different State Backend via `state.backend` parameter than it was during the taking snapshot. +- [State Processor API (writing)]({{< ref "docs/libs/state_processor_api" >}}#writing-new-savepoints) - the ability to create new snapshot via State Processor API. +- [State Processor API (reading)]({{< ref "docs/libs/state_processor_api" >}}#reading-state) - the ability to read states from the existing snapshot via State Processor API. +- Self-contained and relocatable - the one snapshot folder contains everything it needs for recovery +and it doesn't depend on other snapshots which means it can be easily moved to another place if needed. +- [Schema evolution]({{< ref "docs/dev/datastream/fault-tolerance/serialization/schema_evolution" >}}) - changing the *state* data type. +- Arbitrary job upgrade - restoring the snapshot with the different [partitioning type]({{< ref "docs/dev/datastream/operators/overview" >}}#physical-partitioning)(rescale, rebalance, map, etc.) +or with the different record type for the existing operator. +- Non-arbitrary job upgrade - restoring the snapshot with the new operator but without changing the graph shape and record types. +- Flink minor version upgrade - restoring the snapshot which was taken for the older minor version of Flink (1.x → 1.y). +- Flink bug/patch version upgrade - restoring the snapshot which was taken for the older patch version of Flink (1.14.x → 1.14.y). Review comment: ```suggestion - Flink bug/patch version upgrade - restoring a snapshot taken with an older patch version of Flink (1.14.x → 1.14.y). ``` ########## File path: docs/content/docs/ops/state/checkpoints_vs_savepoints.md ########## @@ -0,0 +1,97 @@ +--- +title: "Checkpoints VS Savepoints" +weight: 10 +type: docs +aliases: + - /ops/state/checkpoints_vs_savepoints.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Checkpoints VS Savepoints + +## Overview + +Conceptually, Flink's [Savepoints]({{< ref "docs/ops/state/savepoints" >}}) are different from [Checkpoints]({{< ref "docs/ops/state/checkpoints" >}}) +in a similar way that backups are different from recovery logs in traditional database systems. + +The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. +A [Checkpoint's lifecycle]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is managed by Flink, +i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. +As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are +i) being as lightweight to create and ii) being as fast to restore from as possible. +Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn't change between the execution attempts. +Checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints). + +{{< hint info >}} +- Checkpoints are automatically deleted if the application is terminated by the user +(except if checkpoints are explicitly configured to be retained). +- Checkpoints are stored in state backend-specific (native) data format (may be incremental depending on the specific backend). +{{< /hint >}} + +Although [savepoints]({{< ref "docs/ops/state/savepoints" >}}) are created internally with the same mechanisms as +checkpoints, they are conceptually different and can be a bit more expensive to produce and restore from and focus +more on portability and flexibility with respect to changes to the job. Review comment: ```suggestion Although [savepoints]({{< ref "docs/ops/state/savepoints" >}}) are created internally with the same mechanisms as checkpoints, they are conceptually different and can be a bit more expensive to produce and restore from. Their design focuses more on portability and operational flexibility, especially with respect to changes to the job. ``` ########## File path: docs/content/docs/ops/state/checkpoints_vs_savepoints.md ########## @@ -0,0 +1,97 @@ +--- +title: "Checkpoints VS Savepoints" +weight: 10 +type: docs +aliases: + - /ops/state/checkpoints_vs_savepoints.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Checkpoints VS Savepoints + +## Overview + +Conceptually, Flink's [Savepoints]({{< ref "docs/ops/state/savepoints" >}}) are different from [Checkpoints]({{< ref "docs/ops/state/checkpoints" >}}) +in a similar way that backups are different from recovery logs in traditional database systems. + +The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. +A [Checkpoint's lifecycle]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is managed by Flink, +i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. +As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are +i) being as lightweight to create and ii) being as fast to restore from as possible. +Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn't change between the execution attempts. +Checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints). + +{{< hint info >}} +- Checkpoints are automatically deleted if the application is terminated by the user +(except if checkpoints are explicitly configured to be retained). +- Checkpoints are stored in state backend-specific (native) data format (may be incremental depending on the specific backend). +{{< /hint >}} + +Although [savepoints]({{< ref "docs/ops/state/savepoints" >}}) are created internally with the same mechanisms as +checkpoints, they are conceptually different and can be a bit more expensive to produce and restore from and focus +more on portability and flexibility with respect to changes to the job. +Their use case is planned, manual backup and resume. For example, this could be an update of your Flink version, changing your job graph, +changing parallelism, forking a second job like for a red/blue deployment, and so on. Review comment: Given that we support rescaling and forking from retained checkpoints, I think it's potentially confusing to explicitly mention those as motivating use cases for savepoints. ```suggestion The use case for savepoints is for planned, manual operations. For example, this could be an update of your Flink version, changing your job graph, and so on. ``` ########## File path: docs/content/docs/ops/state/checkpoints_vs_savepoints.md ########## @@ -0,0 +1,97 @@ +--- +title: "Checkpoints VS Savepoints" +weight: 10 +type: docs +aliases: + - /ops/state/checkpoints_vs_savepoints.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Checkpoints VS Savepoints + +## Overview + +Conceptually, Flink's [Savepoints]({{< ref "docs/ops/state/savepoints" >}}) are different from [Checkpoints]({{< ref "docs/ops/state/checkpoints" >}}) +in a similar way that backups are different from recovery logs in traditional database systems. + +The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. +A [Checkpoint's lifecycle]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is managed by Flink, +i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. +As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are +i) being as lightweight to create and ii) being as fast to restore from as possible. +Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn't change between the execution attempts. Review comment: ```suggestion Optimizations towards those goals can exploit certain properties, e.g., that the job code doesn't change between the execution attempts. ``` ########## File path: docs/content/docs/ops/state/checkpoints_vs_savepoints.md ########## @@ -0,0 +1,97 @@ +--- +title: "Checkpoints VS Savepoints" +weight: 10 +type: docs +aliases: + - /ops/state/checkpoints_vs_savepoints.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Checkpoints VS Savepoints + +## Overview + +Conceptually, Flink's [Savepoints]({{< ref "docs/ops/state/savepoints" >}}) are different from [Checkpoints]({{< ref "docs/ops/state/checkpoints" >}}) +in a similar way that backups are different from recovery logs in traditional database systems. + +The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. +A [Checkpoint's lifecycle]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is managed by Flink, +i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. +As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are +i) being as lightweight to create and ii) being as fast to restore from as possible. +Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn't change between the execution attempts. +Checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints). Review comment: ```suggestion Checkpoints are usually deleted after the job has been terminated by the user (except if explicitly configured as retained Checkpoints). ``` ########## File path: docs/content/docs/ops/state/checkpoints_vs_savepoints.md ########## @@ -0,0 +1,97 @@ +--- +title: "Checkpoints VS Savepoints" +weight: 10 +type: docs +aliases: + - /ops/state/checkpoints_vs_savepoints.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Checkpoints VS Savepoints + +## Overview + +Conceptually, Flink's [Savepoints]({{< ref "docs/ops/state/savepoints" >}}) are different from [Checkpoints]({{< ref "docs/ops/state/checkpoints" >}}) +in a similar way that backups are different from recovery logs in traditional database systems. + +The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. +A [Checkpoint's lifecycle]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is managed by Flink, +i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. +As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are +i) being as lightweight to create and ii) being as fast to restore from as possible. +Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn't change between the execution attempts. +Checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints). + +{{< hint info >}} +- Checkpoints are automatically deleted if the application is terminated by the user +(except if checkpoints are explicitly configured to be retained). +- Checkpoints are stored in state backend-specific (native) data format (may be incremental depending on the specific backend). +{{< /hint >}} + +Although [savepoints]({{< ref "docs/ops/state/savepoints" >}}) are created internally with the same mechanisms as +checkpoints, they are conceptually different and can be a bit more expensive to produce and restore from and focus +more on portability and flexibility with respect to changes to the job. +Their use case is planned, manual backup and resume. For example, this could be an update of your Flink version, changing your job graph, +changing parallelism, forking a second job like for a red/blue deployment, and so on. + +{{< hint info >}} +- Savepoints are created, owned and deleted solely by the user. +That means, Flink does not delete savepoints neither after job termination nor after +restore. +- Savepoints are stored in a state backend independent (canonical) format (Note: Since Flink 1.15, savepoints can be also stored in +the backend-specific [native]({{< ref "docs/ops/state/savepoints" >}}#savepoint-format) format which is faster to create +and restore but comes with some limitations. +{{< /hint >}} + +### Capabilities and limitations +The following table gives an overview of capabilities and limitations for the various types of savepoints and +checkpoints. +- ✓ - Flink fully support this type of the snapshot +- x - Flink doesn't support this type of the snapshot +- ! - in fact, Flink support this type but officially, Flink doesn't support this type of the snapshot(there is certain level of risk to use it) Review comment: ```suggestion - ! - While these operations currently work, Flink doesn't officially guarantee support for them, so there is a certain level of risk associated with them ``` ########## File path: docs/content/docs/ops/state/checkpoints_vs_savepoints.md ########## @@ -0,0 +1,97 @@ +--- +title: "Checkpoints VS Savepoints" +weight: 10 +type: docs +aliases: + - /ops/state/checkpoints_vs_savepoints.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Checkpoints VS Savepoints + +## Overview + +Conceptually, Flink's [Savepoints]({{< ref "docs/ops/state/savepoints" >}}) are different from [Checkpoints]({{< ref "docs/ops/state/checkpoints" >}}) +in a similar way that backups are different from recovery logs in traditional database systems. + +The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. +A [Checkpoint's lifecycle]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is managed by Flink, +i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. +As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are +i) being as lightweight to create and ii) being as fast to restore from as possible. +Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn't change between the execution attempts. +Checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints). + +{{< hint info >}} +- Checkpoints are automatically deleted if the application is terminated by the user +(except if checkpoints are explicitly configured to be retained). +- Checkpoints are stored in state backend-specific (native) data format (may be incremental depending on the specific backend). +{{< /hint >}} + +Although [savepoints]({{< ref "docs/ops/state/savepoints" >}}) are created internally with the same mechanisms as +checkpoints, they are conceptually different and can be a bit more expensive to produce and restore from and focus +more on portability and flexibility with respect to changes to the job. +Their use case is planned, manual backup and resume. For example, this could be an update of your Flink version, changing your job graph, +changing parallelism, forking a second job like for a red/blue deployment, and so on. + +{{< hint info >}} +- Savepoints are created, owned and deleted solely by the user. +That means, Flink does not delete savepoints neither after job termination nor after +restore. +- Savepoints are stored in a state backend independent (canonical) format (Note: Since Flink 1.15, savepoints can be also stored in +the backend-specific [native]({{< ref "docs/ops/state/savepoints" >}}#savepoint-format) format which is faster to create +and restore but comes with some limitations. +{{< /hint >}} + +### Capabilities and limitations +The following table gives an overview of capabilities and limitations for the various types of savepoints and +checkpoints. +- ✓ - Flink fully support this type of the snapshot +- x - Flink doesn't support this type of the snapshot +- ! - in fact, Flink support this type but officially, Flink doesn't support this type of the snapshot(there is certain level of risk to use it) + +| Operation | Canonical Savepoint | Native Savepoint | Aligned Checkpoint | Unaligned Checkpoint | +|:--------------------------------|:--------------------|:-----------------|:-------------------|:---------------------| +| State backend change | ✓ | x | x | x | +| State Processor API(writing) | ✓ | x | x | x | +| State Processor API(reading) | ✓ | ! | ! | x | +| Self-contained and relocatable | ✓ | ✓ | x | x | +| Schema evolution | ✓ | ! | ! | ! | +| Arbitrary job upgrade | ✓ | ✓ | ✓ | x | +| Non-arbitrary job upgrade | ✓ | ✓ | ✓ | x | +| Flink minor version upgrade | ✓ | ✓ | ✓ | x | +| Flink bug/patch version upgrade | ✓ | ✓ | ✓ | ✓ | +| Rescaling | ✓ | ✓ | ✓ | ✓ | + +- [State backend change]({{< ref "docs/ops/state/state_backends" >}}) - configuring different State Backend via `state.backend` parameter than it was during the taking snapshot. +- [State Processor API (writing)]({{< ref "docs/libs/state_processor_api" >}}#writing-new-savepoints) - the ability to create new snapshot via State Processor API. +- [State Processor API (reading)]({{< ref "docs/libs/state_processor_api" >}}#reading-state) - the ability to read states from the existing snapshot via State Processor API. Review comment: ```suggestion - [State Processor API (reading)]({{< ref "docs/libs/state_processor_api" >}}#reading-state) - the ability to read states from an existing snapshot of this type via the State Processor API. ``` ########## File path: docs/content/docs/ops/state/checkpoints_vs_savepoints.md ########## @@ -0,0 +1,97 @@ +--- +title: "Checkpoints VS Savepoints" +weight: 10 +type: docs +aliases: + - /ops/state/checkpoints_vs_savepoints.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Checkpoints VS Savepoints + +## Overview + +Conceptually, Flink's [Savepoints]({{< ref "docs/ops/state/savepoints" >}}) are different from [Checkpoints]({{< ref "docs/ops/state/checkpoints" >}}) +in a similar way that backups are different from recovery logs in traditional database systems. + +The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. +A [Checkpoint's lifecycle]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is managed by Flink, +i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. +As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are +i) being as lightweight to create and ii) being as fast to restore from as possible. +Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn't change between the execution attempts. +Checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints). + +{{< hint info >}} +- Checkpoints are automatically deleted if the application is terminated by the user +(except if checkpoints are explicitly configured to be retained). +- Checkpoints are stored in state backend-specific (native) data format (may be incremental depending on the specific backend). +{{< /hint >}} + +Although [savepoints]({{< ref "docs/ops/state/savepoints" >}}) are created internally with the same mechanisms as +checkpoints, they are conceptually different and can be a bit more expensive to produce and restore from and focus +more on portability and flexibility with respect to changes to the job. +Their use case is planned, manual backup and resume. For example, this could be an update of your Flink version, changing your job graph, +changing parallelism, forking a second job like for a red/blue deployment, and so on. + +{{< hint info >}} +- Savepoints are created, owned and deleted solely by the user. +That means, Flink does not delete savepoints neither after job termination nor after +restore. +- Savepoints are stored in a state backend independent (canonical) format (Note: Since Flink 1.15, savepoints can be also stored in +the backend-specific [native]({{< ref "docs/ops/state/savepoints" >}}#savepoint-format) format which is faster to create +and restore but comes with some limitations. +{{< /hint >}} + +### Capabilities and limitations +The following table gives an overview of capabilities and limitations for the various types of savepoints and +checkpoints. +- ✓ - Flink fully support this type of the snapshot +- x - Flink doesn't support this type of the snapshot +- ! - in fact, Flink support this type but officially, Flink doesn't support this type of the snapshot(there is certain level of risk to use it) + +| Operation | Canonical Savepoint | Native Savepoint | Aligned Checkpoint | Unaligned Checkpoint | +|:--------------------------------|:--------------------|:-----------------|:-------------------|:---------------------| +| State backend change | ✓ | x | x | x | +| State Processor API(writing) | ✓ | x | x | x | +| State Processor API(reading) | ✓ | ! | ! | x | +| Self-contained and relocatable | ✓ | ✓ | x | x | +| Schema evolution | ✓ | ! | ! | ! | +| Arbitrary job upgrade | ✓ | ✓ | ✓ | x | +| Non-arbitrary job upgrade | ✓ | ✓ | ✓ | x | +| Flink minor version upgrade | ✓ | ✓ | ✓ | x | +| Flink bug/patch version upgrade | ✓ | ✓ | ✓ | ✓ | +| Rescaling | ✓ | ✓ | ✓ | ✓ | + +- [State backend change]({{< ref "docs/ops/state/state_backends" >}}) - configuring different State Backend via `state.backend` parameter than it was during the taking snapshot. +- [State Processor API (writing)]({{< ref "docs/libs/state_processor_api" >}}#writing-new-savepoints) - the ability to create new snapshot via State Processor API. Review comment: ```suggestion - [State Processor API (writing)]({{< ref "docs/libs/state_processor_api" >}}#writing-new-savepoints) - the ability to create a new snapshot of this type via the State Processor API. ``` ########## File path: docs/content/docs/ops/state/checkpoints_vs_savepoints.md ########## @@ -0,0 +1,97 @@ +--- +title: "Checkpoints VS Savepoints" +weight: 10 +type: docs +aliases: + - /ops/state/checkpoints_vs_savepoints.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Checkpoints VS Savepoints + +## Overview + +Conceptually, Flink's [Savepoints]({{< ref "docs/ops/state/savepoints" >}}) are different from [Checkpoints]({{< ref "docs/ops/state/checkpoints" >}}) +in a similar way that backups are different from recovery logs in traditional database systems. + +The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. +A [Checkpoint's lifecycle]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is managed by Flink, +i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. +As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are +i) being as lightweight to create and ii) being as fast to restore from as possible. +Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn't change between the execution attempts. +Checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints). + +{{< hint info >}} +- Checkpoints are automatically deleted if the application is terminated by the user +(except if checkpoints are explicitly configured to be retained). +- Checkpoints are stored in state backend-specific (native) data format (may be incremental depending on the specific backend). +{{< /hint >}} + +Although [savepoints]({{< ref "docs/ops/state/savepoints" >}}) are created internally with the same mechanisms as +checkpoints, they are conceptually different and can be a bit more expensive to produce and restore from and focus +more on portability and flexibility with respect to changes to the job. +Their use case is planned, manual backup and resume. For example, this could be an update of your Flink version, changing your job graph, +changing parallelism, forking a second job like for a red/blue deployment, and so on. + +{{< hint info >}} +- Savepoints are created, owned and deleted solely by the user. +That means, Flink does not delete savepoints neither after job termination nor after +restore. +- Savepoints are stored in a state backend independent (canonical) format (Note: Since Flink 1.15, savepoints can be also stored in +the backend-specific [native]({{< ref "docs/ops/state/savepoints" >}}#savepoint-format) format which is faster to create +and restore but comes with some limitations. +{{< /hint >}} + +### Capabilities and limitations +The following table gives an overview of capabilities and limitations for the various types of savepoints and +checkpoints. +- ✓ - Flink fully support this type of the snapshot +- x - Flink doesn't support this type of the snapshot +- ! - in fact, Flink support this type but officially, Flink doesn't support this type of the snapshot(there is certain level of risk to use it) + +| Operation | Canonical Savepoint | Native Savepoint | Aligned Checkpoint | Unaligned Checkpoint | +|:--------------------------------|:--------------------|:-----------------|:-------------------|:---------------------| +| State backend change | ✓ | x | x | x | +| State Processor API(writing) | ✓ | x | x | x | +| State Processor API(reading) | ✓ | ! | ! | x | +| Self-contained and relocatable | ✓ | ✓ | x | x | +| Schema evolution | ✓ | ! | ! | ! | +| Arbitrary job upgrade | ✓ | ✓ | ✓ | x | +| Non-arbitrary job upgrade | ✓ | ✓ | ✓ | x | +| Flink minor version upgrade | ✓ | ✓ | ✓ | x | +| Flink bug/patch version upgrade | ✓ | ✓ | ✓ | ✓ | +| Rescaling | ✓ | ✓ | ✓ | ✓ | + +- [State backend change]({{< ref "docs/ops/state/state_backends" >}}) - configuring different State Backend via `state.backend` parameter than it was during the taking snapshot. +- [State Processor API (writing)]({{< ref "docs/libs/state_processor_api" >}}#writing-new-savepoints) - the ability to create new snapshot via State Processor API. +- [State Processor API (reading)]({{< ref "docs/libs/state_processor_api" >}}#reading-state) - the ability to read states from the existing snapshot via State Processor API. +- Self-contained and relocatable - the one snapshot folder contains everything it needs for recovery +and it doesn't depend on other snapshots which means it can be easily moved to another place if needed. +- [Schema evolution]({{< ref "docs/dev/datastream/fault-tolerance/serialization/schema_evolution" >}}) - changing the *state* data type. +- Arbitrary job upgrade - restoring the snapshot with the different [partitioning type]({{< ref "docs/dev/datastream/operators/overview" >}}#physical-partitioning)(rescale, rebalance, map, etc.) +or with the different record type for the existing operator. +- Non-arbitrary job upgrade - restoring the snapshot with the new operator but without changing the graph shape and record types. +- Flink minor version upgrade - restoring the snapshot which was taken for the older minor version of Flink (1.x → 1.y). Review comment: ```suggestion - Flink minor version upgrade - restoring a snapshot taken with an older minor version of Flink (1.x → 1.y). ``` ########## File path: docs/content/docs/ops/state/checkpoints_vs_savepoints.md ########## @@ -0,0 +1,97 @@ +--- +title: "Checkpoints VS Savepoints" +weight: 10 +type: docs +aliases: + - /ops/state/checkpoints_vs_savepoints.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Checkpoints VS Savepoints + +## Overview + +Conceptually, Flink's [Savepoints]({{< ref "docs/ops/state/savepoints" >}}) are different from [Checkpoints]({{< ref "docs/ops/state/checkpoints" >}}) +in a similar way that backups are different from recovery logs in traditional database systems. + +The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. +A [Checkpoint's lifecycle]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is managed by Flink, +i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. +As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are +i) being as lightweight to create and ii) being as fast to restore from as possible. +Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn't change between the execution attempts. +Checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints). + +{{< hint info >}} +- Checkpoints are automatically deleted if the application is terminated by the user +(except if checkpoints are explicitly configured to be retained). +- Checkpoints are stored in state backend-specific (native) data format (may be incremental depending on the specific backend). +{{< /hint >}} + +Although [savepoints]({{< ref "docs/ops/state/savepoints" >}}) are created internally with the same mechanisms as +checkpoints, they are conceptually different and can be a bit more expensive to produce and restore from and focus +more on portability and flexibility with respect to changes to the job. +Their use case is planned, manual backup and resume. For example, this could be an update of your Flink version, changing your job graph, +changing parallelism, forking a second job like for a red/blue deployment, and so on. + +{{< hint info >}} +- Savepoints are created, owned and deleted solely by the user. +That means, Flink does not delete savepoints neither after job termination nor after +restore. +- Savepoints are stored in a state backend independent (canonical) format (Note: Since Flink 1.15, savepoints can be also stored in +the backend-specific [native]({{< ref "docs/ops/state/savepoints" >}}#savepoint-format) format which is faster to create +and restore but comes with some limitations. +{{< /hint >}} + +### Capabilities and limitations +The following table gives an overview of capabilities and limitations for the various types of savepoints and +checkpoints. +- ✓ - Flink fully support this type of the snapshot +- x - Flink doesn't support this type of the snapshot +- ! - in fact, Flink support this type but officially, Flink doesn't support this type of the snapshot(there is certain level of risk to use it) + +| Operation | Canonical Savepoint | Native Savepoint | Aligned Checkpoint | Unaligned Checkpoint | +|:--------------------------------|:--------------------|:-----------------|:-------------------|:---------------------| +| State backend change | ✓ | x | x | x | +| State Processor API(writing) | ✓ | x | x | x | +| State Processor API(reading) | ✓ | ! | ! | x | +| Self-contained and relocatable | ✓ | ✓ | x | x | +| Schema evolution | ✓ | ! | ! | ! | +| Arbitrary job upgrade | ✓ | ✓ | ✓ | x | +| Non-arbitrary job upgrade | ✓ | ✓ | ✓ | x | +| Flink minor version upgrade | ✓ | ✓ | ✓ | x | +| Flink bug/patch version upgrade | ✓ | ✓ | ✓ | ✓ | +| Rescaling | ✓ | ✓ | ✓ | ✓ | + +- [State backend change]({{< ref "docs/ops/state/state_backends" >}}) - configuring different State Backend via `state.backend` parameter than it was during the taking snapshot. +- [State Processor API (writing)]({{< ref "docs/libs/state_processor_api" >}}#writing-new-savepoints) - the ability to create new snapshot via State Processor API. +- [State Processor API (reading)]({{< ref "docs/libs/state_processor_api" >}}#reading-state) - the ability to read states from the existing snapshot via State Processor API. +- Self-contained and relocatable - the one snapshot folder contains everything it needs for recovery +and it doesn't depend on other snapshots which means it can be easily moved to another place if needed. +- [Schema evolution]({{< ref "docs/dev/datastream/fault-tolerance/serialization/schema_evolution" >}}) - changing the *state* data type. Review comment: ```suggestion - [Schema evolution]({{< ref "docs/dev/datastream/fault-tolerance/serialization/schema_evolution" >}}) - the *state* data type can be changed if it uses a serializer that supports schema evolution (e.g., POJOs and Avro types) ``` ########## File path: docs/content/docs/ops/state/checkpoints_vs_savepoints.md ########## @@ -0,0 +1,97 @@ +--- +title: "Checkpoints VS Savepoints" +weight: 10 +type: docs +aliases: + - /ops/state/checkpoints_vs_savepoints.html +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +# Checkpoints VS Savepoints + +## Overview + +Conceptually, Flink's [Savepoints]({{< ref "docs/ops/state/savepoints" >}}) are different from [Checkpoints]({{< ref "docs/ops/state/checkpoints" >}}) +in a similar way that backups are different from recovery logs in traditional database systems. + +The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. +A [Checkpoint's lifecycle]({{< ref "docs/dev/datastream/fault-tolerance/checkpointing" >}}) is managed by Flink, +i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. +As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are +i) being as lightweight to create and ii) being as fast to restore from as possible. +Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn't change between the execution attempts. +Checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints). + +{{< hint info >}} +- Checkpoints are automatically deleted if the application is terminated by the user +(except if checkpoints are explicitly configured to be retained). +- Checkpoints are stored in state backend-specific (native) data format (may be incremental depending on the specific backend). +{{< /hint >}} + +Although [savepoints]({{< ref "docs/ops/state/savepoints" >}}) are created internally with the same mechanisms as +checkpoints, they are conceptually different and can be a bit more expensive to produce and restore from and focus +more on portability and flexibility with respect to changes to the job. +Their use case is planned, manual backup and resume. For example, this could be an update of your Flink version, changing your job graph, +changing parallelism, forking a second job like for a red/blue deployment, and so on. + +{{< hint info >}} +- Savepoints are created, owned and deleted solely by the user. +That means, Flink does not delete savepoints neither after job termination nor after +restore. +- Savepoints are stored in a state backend independent (canonical) format (Note: Since Flink 1.15, savepoints can be also stored in +the backend-specific [native]({{< ref "docs/ops/state/savepoints" >}}#savepoint-format) format which is faster to create +and restore but comes with some limitations. +{{< /hint >}} + +### Capabilities and limitations +The following table gives an overview of capabilities and limitations for the various types of savepoints and +checkpoints. +- ✓ - Flink fully support this type of the snapshot +- x - Flink doesn't support this type of the snapshot +- ! - in fact, Flink support this type but officially, Flink doesn't support this type of the snapshot(there is certain level of risk to use it) + +| Operation | Canonical Savepoint | Native Savepoint | Aligned Checkpoint | Unaligned Checkpoint | +|:--------------------------------|:--------------------|:-----------------|:-------------------|:---------------------| +| State backend change | ✓ | x | x | x | +| State Processor API(writing) | ✓ | x | x | x | +| State Processor API(reading) | ✓ | ! | ! | x | +| Self-contained and relocatable | ✓ | ✓ | x | x | +| Schema evolution | ✓ | ! | ! | ! | +| Arbitrary job upgrade | ✓ | ✓ | ✓ | x | +| Non-arbitrary job upgrade | ✓ | ✓ | ✓ | x | +| Flink minor version upgrade | ✓ | ✓ | ✓ | x | +| Flink bug/patch version upgrade | ✓ | ✓ | ✓ | ✓ | +| Rescaling | ✓ | ✓ | ✓ | ✓ | + +- [State backend change]({{< ref "docs/ops/state/state_backends" >}}) - configuring different State Backend via `state.backend` parameter than it was during the taking snapshot. +- [State Processor API (writing)]({{< ref "docs/libs/state_processor_api" >}}#writing-new-savepoints) - the ability to create new snapshot via State Processor API. +- [State Processor API (reading)]({{< ref "docs/libs/state_processor_api" >}}#reading-state) - the ability to read states from the existing snapshot via State Processor API. +- Self-contained and relocatable - the one snapshot folder contains everything it needs for recovery +and it doesn't depend on other snapshots which means it can be easily moved to another place if needed. +- [Schema evolution]({{< ref "docs/dev/datastream/fault-tolerance/serialization/schema_evolution" >}}) - changing the *state* data type. +- Arbitrary job upgrade - restoring the snapshot with the different [partitioning type]({{< ref "docs/dev/datastream/operators/overview" >}}#physical-partitioning)(rescale, rebalance, map, etc.) +or with the different record type for the existing operator. +- Non-arbitrary job upgrade - restoring the snapshot with the new operator but without changing the graph shape and record types. Review comment: ```suggestion - Non-arbitrary job upgrade - restoring the snapshot is possible with updated operators if the job graph topology and in-flight record types remain unchanged. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
