[
https://issues.apache.org/jira/browse/FLINK-9114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jacob Park updated FLINK-9114:
------------------------------
Description:
When you operate a Flink application that uses externalized checkpoints to S3,
it becomes difficult to determine which checkpoint is the latest to recover
from. Because S3 provides read-after-write consistency only for PUTS, listing a
S3 path is not guaranteed to be consistent, so we do not know what checkpoint
to recover from.
The goal of this improvement is to allow users to provide a custom
CheckpointRecoveryFactory for non-HA deployments such that we can use this
feature to fail checkpoints if we cannot guarantee we will know where a
checkpoint will be in S3, and co-publish checkpoint metadata to a strongly
consistent data store.
I propose the following changes:
# Modify AbstractNonHaServices and StandaloneHaServices to accept an Executor
for the custom CheckpointRecoveryFactory.
# Create a CheckpointRecoveryFactoryLoader to provide the custom
CheckpointRecoveryFactory from configurations.
# Add new configurations for this feature.
We considered the pluggable StateBackend and the potentially pluggable
HighAvailabilityServices. These were too convoluted to solve our problem, so we
would like to implement a custom CheckpointRecoveryFactory mechanism.
was:
When you operate a Flink application that uses externalized checkpoints to S3,
it becomes difficult to determine which checkpoint is the latest to recover
from. Because S3 provides read-after-write consistency only for PUTS, listing a
S3 path is not guaranteed to be consistent, so we do not know what checkpoint
to recover from.
The goal of this improvement is to allow users to provide a custom
CheckpointRecoveryFactory for non-HA deployments such that we can use this
feature to fail checkpoints if we cannot guarantee we will know where a
checkpoint will be in S3, and co-publish checkpoint metadata to a strongly
consistent data store.
I propose the following changes:
# Modify AbstractNonHaServices and StandaloneHaServices to accept an Executor
for the custom CheckpointRecoveryFactory.
# Create a CheckpointRecoveryFactoryLoader to provide the custom
CheckpointRecoveryFactory from configurations.
# Add new configurations for this feature.
We considered the pluggable StateBackend and potential pluggable
HighAvailabilityServices. These were too convoluted to solve our problem, so we
would like custom CheckpointRecoveryFactory.
> Enable user-provided, custom CheckpointRecoveryFactory for non-HA deployments
> -----------------------------------------------------------------------------
>
> Key: FLINK-9114
> URL: https://issues.apache.org/jira/browse/FLINK-9114
> Project: Flink
> Issue Type: Improvement
> Components: Configuration, State Backends, Checkpointing
> Reporter: Jacob Park
> Assignee: Jacob Park
> Priority: Major
>
> When you operate a Flink application that uses externalized checkpoints to
> S3, it becomes difficult to determine which checkpoint is the latest to
> recover from. Because S3 provides read-after-write consistency only for PUTS,
> listing a S3 path is not guaranteed to be consistent, so we do not know what
> checkpoint to recover from.
> The goal of this improvement is to allow users to provide a custom
> CheckpointRecoveryFactory for non-HA deployments such that we can use this
> feature to fail checkpoints if we cannot guarantee we will know where a
> checkpoint will be in S3, and co-publish checkpoint metadata to a strongly
> consistent data store.
> I propose the following changes:
> # Modify AbstractNonHaServices and StandaloneHaServices to accept an
> Executor for the custom CheckpointRecoveryFactory.
> # Create a CheckpointRecoveryFactoryLoader to provide the custom
> CheckpointRecoveryFactory from configurations.
> # Add new configurations for this feature.
> We considered the pluggable StateBackend and the potentially pluggable
> HighAvailabilityServices. These were too convoluted to solve our problem, so
> we would like to implement a custom CheckpointRecoveryFactory mechanism.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)