Hi All, We are building a geo-redundant Flink deployment with an active primary and a passive secondary.
Requirement: The passive must be able to resume exactly-once from the primary's latest checkpoint. Primary checkpoints are written to local NFS (fast but not remotely accessible), so we prototyped a custom CheckpointStorage plugin that synchronously writes each checkpoint to both local storage and a remote store the passive can access, and only marks a checkpoint complete once both copies are durably persisted. Tradeoffs: favors consistency over performance - expect higher checkpoint latency, more network/storage cost, and extra complexity around partial replication and cleanup. Looking for quick feedback on: * Best practices / patterns for implementing dual synchronous checkpoint writes in Flink * Reliable ways to atomically mark a checkpoint as "fully replicated" so the passive can safely restore * Alternatives others use for geo-redundant exactly-once state (geo-replicated object store, external replication, savepoints, etc.) We have a small POC and can share design/code if helpful. Thanks for any pointers! Best Regards, Mukul Gupta
