Hi Iceberg community,

I'd like to propose adding a new `ROUND_ROBIN` value to the core 
`DistributionMode` enum and would appreciate the community's input.

### Background

In Flink's `DynamicIcebergSink`, `DistributionMode.NONE` currently distributes 
records across writer subtasks in a round-robin fashion. This is inconsistent 
with the regular `IcebergSink`, where `NONE` means records are forwarded 
directly to writers without any redistribution. The round-robin behavior in 
`DynamicIcebergSink` forces records through a hash shuffle, which incurs serdes 
​ and network cost — even when no meaningful distribution is needed. In 
high-throughput pipelines, these overheads can be a significant bottleneck.

By giving the round-robin behavior its own explicit mode (`ROUND_ROBIN`), we 
can make `NONE` consistently mean "no redistribution" across all Flink sinks. 
In `DynamicIcebergSink`, this enables operator chaining via a forward edge, 
eliminating the serdes and shuffle overhead entirely.

### Why this is a core API change

The `DynamicRecord` API already uses the core `DistributionMode` enum to 
specify distribution per-record. Adding a new mode to distinguish between 
"truly no distribution" (`NONE`) and "distribute evenly across writers" 
(`ROUND_ROBIN`) requires adding the new value to the core enum in 
`org.apache.iceberg.DistributionMode`.

### Flink consensus

This has been discussed on PR #15433 [1] with the committers (Maximilian 
Michels and Peter Vary). There is agreement on the Flink side that:
- `NONE` should mean true forward/no-redistribution
- `ROUND_ROBIN` should take over the current round-robin behavior of `NONE` in 
`DynamicIcebergSink`

The discussion point for this thread is specifically about adding `ROUND_ROBIN` 
to the core API enum and its impact on other engine integrations.

### Handling in Spark

Spark's connector API does not have a round-robin distribution concept. The 
current PR maps `ROUND_ROBIN` to `Distributions.unspecified()`, the same as 
`NONE`, so that existing Spark tests pass. However, I'd appreciate the 
community's thoughts on which approach is preferred:

1. Treat as alias for NONE: `ROUND_ROBIN` maps to `Distributions.unspecified()` 
like currently in the PR. Pragmatic, but potentially confusing if users expect 
actual round-robin behavior.
2. Fail explicitly: Throw an error if a table is configured with `ROUND_ROBIN` 
in Spark, until a proper implementation (e.g. using Spark's `repartition()`) is 
added.

### Other engine writers

Other engines’ Iceberg writer integrations will also need to handle the new 
enum value. Engine maintainers will need to consider whether `ROUND_ROBIN` has 
a meaningful distinct behavior in their context or whether it should be treated 
as an alias/error.

Looking forward to the community's feedback.

[1] https://github.com/apache/iceberg/pull/15433

Best,
Han

________________________________

The information in this e-mail is intended only for the person or entity to 
which it is addressed.

It may contain confidential and /or privileged material, the disclosure of 
which is prohibited. Any unauthorized copying, disclosure or distribution of 
the information in this email outside your company is strictly forbidden.

If you are not the intended recipient (or have received this email in error), 
please contact the sender immediately and permanently delete all copies of this 
email and any attachments from your computer system and destroy any hard 
copies. Although the information in this email has been compiled with great 
care, neither IMC nor any of its related entities shall accept any 
responsibility for any errors, omissions or other inaccuracies in this 
information or for the consequences thereof, nor shall it be bound in any way 
by the contents of this e-mail or its attachments.

Messages and attachments are scanned for all known viruses. Always scan 
attachments before opening them.

Reply via email to