Hi All,

We are building a geo-redundant Flink deployment with an active primary and a 
passive secondary.

Requirement: The passive must be able to resume exactly-once from the primary's 
latest checkpoint. Primary checkpoints are written to local NFS (fast but not 
remotely accessible), so we prototyped a custom CheckpointStorage plugin that 
synchronously writes each checkpoint to both local storage and a remote store 
the passive can access, and only marks a checkpoint complete once both copies 
are durably persisted.

Tradeoffs: favors consistency over performance - expect higher checkpoint 
latency, more network/storage cost, and extra complexity around partial 
replication and cleanup.

Looking for quick feedback on:

  *   Best practices / patterns for implementing dual synchronous checkpoint 
writes in Flink
  *   Reliable ways to atomically mark a checkpoint as "fully replicated" so 
the passive can safely restore
  *   Alternatives others use for geo-redundant exactly-once state 
(geo-replicated object store, external replication, savepoints, etc.)

We have a small POC and can share design/code if helpful. Thanks for any 
pointers!

Best Regards,
Mukul Gupta

Reply via email to