Hi Iceberg community, I'd like to propose adding a new `ROUND_ROBIN` value to the core `DistributionMode` enum and would appreciate the community's input.
### Background In Flink's `DynamicIcebergSink`, `DistributionMode.NONE` currently distributes records across writer subtasks in a round-robin fashion. This is inconsistent with the regular `IcebergSink`, where `NONE` means records are forwarded directly to writers without any redistribution. The round-robin behavior in `DynamicIcebergSink` forces records through a hash shuffle, which incurs serdes and network cost — even when no meaningful distribution is needed. In high-throughput pipelines, these overheads can be a significant bottleneck. By giving the round-robin behavior its own explicit mode (`ROUND_ROBIN`), we can make `NONE` consistently mean "no redistribution" across all Flink sinks. In `DynamicIcebergSink`, this enables operator chaining via a forward edge, eliminating the serdes and shuffle overhead entirely. ### Why this is a core API change The `DynamicRecord` API already uses the core `DistributionMode` enum to specify distribution per-record. Adding a new mode to distinguish between "truly no distribution" (`NONE`) and "distribute evenly across writers" (`ROUND_ROBIN`) requires adding the new value to the core enum in `org.apache.iceberg.DistributionMode`. ### Flink consensus This has been discussed on PR #15433 [1] with the committers (Maximilian Michels and Peter Vary). There is agreement on the Flink side that: - `NONE` should mean true forward/no-redistribution - `ROUND_ROBIN` should take over the current round-robin behavior of `NONE` in `DynamicIcebergSink` The discussion point for this thread is specifically about adding `ROUND_ROBIN` to the core API enum and its impact on other engine integrations. ### Handling in Spark Spark's connector API does not have a round-robin distribution concept. The current PR maps `ROUND_ROBIN` to `Distributions.unspecified()`, the same as `NONE`, so that existing Spark tests pass. However, I'd appreciate the community's thoughts on which approach is preferred: 1. Treat as alias for NONE: `ROUND_ROBIN` maps to `Distributions.unspecified()` like currently in the PR. Pragmatic, but potentially confusing if users expect actual round-robin behavior. 2. Fail explicitly: Throw an error if a table is configured with `ROUND_ROBIN` in Spark, until a proper implementation (e.g. using Spark's `repartition()`) is added. ### Other engine writers Other engines’ Iceberg writer integrations will also need to handle the new enum value. Engine maintainers will need to consider whether `ROUND_ROBIN` has a meaningful distinct behavior in their context or whether it should be treated as an alias/error. Looking forward to the community's feedback. [1] https://github.com/apache/iceberg/pull/15433 Best, Han ________________________________ The information in this e-mail is intended only for the person or entity to which it is addressed. It may contain confidential and /or privileged material, the disclosure of which is prohibited. Any unauthorized copying, disclosure or distribution of the information in this email outside your company is strictly forbidden. If you are not the intended recipient (or have received this email in error), please contact the sender immediately and permanently delete all copies of this email and any attachments from your computer system and destroy any hard copies. Although the information in this email has been compiled with great care, neither IMC nor any of its related entities shall accept any responsibility for any errors, omissions or other inaccuracies in this information or for the consequences thereof, nor shall it be bound in any way by the contents of this e-mail or its attachments. Messages and attachments are scanned for all known viruses. Always scan attachments before opening them.
