Thank you for the proposal Xinli! I have one concern on the "synchronous replication" part of the proposal. We propose to do the writes/metadata update locally and then replicate before returning from the commit() method. This introduces some race conditions. For example, let's say the replication step fails and commit aborts, any other parallel transaction reading the local table metadata will see the commit so it did not actually abort. I think this might lead to correctness issues in applications and make things harder to reason about. I wanted to raise this so that we think through this case.
-Jagdeep On Wed, Oct 1, 2025 at 11:55 AM Maninder Parmar < [email protected]> wrote: > Thanks for the proposal Xinli! I have few thoughts about this approach: > > 1. Doing commit time synchronous replication that involves copying data > files will severely limit the transaction throughput as well as > reliability. Even if we want to attempt synchronous replication, it would > be better for file (both data and metadata) replication to be done by the > engine. > 2. In general, it might be performant/easier to design an asynchronous > replication system that provides snapshot isolation guarantees for reads on > the replica. > > Regards, > Maninder > > On Wed, Oct 1, 2025 at 9:47 AM Manu Zhang <[email protected]> wrote: > >> Thanks for the proposal Xinli. I have also thought through Iceberg table >> replication before and have some doubts over this approach. >> >> 1. Will synchronous replication be useful since underlying >> distributed file systems like S3 already provide high durability? On the >> other hand, a cross-datacenter network hiccup would fail the commit. It >> might involve an oncall to disable the option for a commit to succeed if >> the network issue lasts for a while. IMO, replication for disaster recovery >> should be transparent and have no impact on users' applications. >> >> 2. How about commits from rewrite_data_files? Will it replicate the >> entire table if all files of a table have been rewritten? In this case, >> there's actually no "changes" to the table and I think only "changes" are >> needed to replicate. >> >> 3. Metadata replication is not easy to get right. We've seen such issues >> [1] with rewrite_table_path that not updating sizes in manifest lists could >> lead to correctness problems. How about creating new metadata files for >> replicated data files? >> >> [1] https://github.com/apache/iceberg/issues/13719 >> >> Best, >> Manu >> >> On Wed, Oct 1, 2025 at 6:43 AM Chao Sun <[email protected]> wrote: >> >>> Thanks for the proposal Xinli! It sounds very useful and I also just >>> left some comments. >>> >>> On Mon, Sep 29, 2025 at 8:42 PM Gang Wu <[email protected]> wrote: >>> >>>> This thread was accidentally in my spam folder. >>>> >>>> I have left some comments with regard to the implication on the Iceberg >>>> rest catalog side. >>>> >>>> Best, >>>> Gang >>>> >>>> On Tue, Sep 30, 2025 at 5:44 AM Huaxin Gao <[email protected]> >>>> wrote: >>>> >>>>> Thanks for the proposal. I think it's in the right direction. I left >>>>> some comments and will take another look when time allows. >>>>> >>>>> Huaxin >>>>> >>>>> On 2025/09/27 17:27:29 Xinli shang wrote: >>>>> > Hi all, >>>>> > >>>>> > I’d like to propose adding *native incremental replication* to >>>>> Iceberg >>>>> > tables. >>>>> > >>>>> > *Motivation:* Many production deployments require cross–data center >>>>> backup >>>>> > and data locality. Today this is usually handled by external >>>>> services, >>>>> > which adds operational overhead and introduces failure modes outside >>>>> > Iceberg’s transactional boundary. Integrating replication into the >>>>> commit >>>>> > workflow would simplify operations and improve consistency. >>>>> > >>>>> > *Proposal:* An optional replication phase in the commit process would >>>>> > automatically copy data files and metadata to one or more targets >>>>> (e.g., >>>>> > S3, HDFS, GCS, Azure). Replication is configured via table >>>>> properties and >>>>> > supports both synchronous (immediate consistency, higher latency) and >>>>> > asynchronous (background retries, eventual consistency) modes. This >>>>> > provides built-in disaster recovery, data locality optimization, and >>>>> > cross-region analytics without external tool >>>>> > >>>>> > Full draft proposal with design details is here: >>>>> > 👉 Incremental Iceberg Replication Proposal >>>>> > < >>>>> https://docs.google.com/document/d/1yrVLs0CQyIHs9WbBVx_EK6ad419Adsl9xHozpmQEMrs/edit?tab=t.0#heading=h.aa5ph23raz9l >>>>> > >>>>> > >>>>> > Thanks, >>>>> > Xinli >>>>> > >>>>> >>>>
