Re: [PR] Flink: Add passthroughRecords option to DynamicIcebergSink [iceberg]

via GitHub Wed, 25 Feb 2026 10:42:59 -0800


sqd commented on PR #15433:
URL: https://github.com/apache/iceberg/pull/15433#issuecomment-3961254603


   > That said, the potential performance improvements need to outweigh the 
slight increase in complexity
   
   I actually have some numbers! Before the change the pipeline took around 
1~1.5TB of memory and ~200 cores. With the change it shaved 50~70 cores (not to 
mention the increased throughput). Of course there are other computation going 
on as well, but parquet writing and Flink RowData serdes showed up in profiler 
to take >90% CPU combined. Serdes was taking up around 75% CPU of the actual 
parquet writing.
   
   > Could you share a bit more about your use case
   
   My use case is that I have a firehose of data that I want to ingest into 
Iceberg. Because the volume is so high, it doesn't really matter which writer 
subtasks a record is routed to, there won't be small files either way. I was 
running DistributionMode.NONE, and noticed that serdes was taking up a 
ridiculous amount of resources, also caused a lot of unnecessary network 
shuffling.
   
   > adding a new DistributionMode
   
   I am a big fan of calling it ROUND_ROBIN instead, but are we not worried 
about breaking existing code? Maybe introduce ROUND_ROBIN as an alias for NONE, 
and this new mode can be called "PASSTROUGH" or something?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Flink: Add passthroughRecords option to DynamicIcebergSink [iceberg]

Reply via email to