sqd commented on PR #15433: URL: https://github.com/apache/iceberg/pull/15433#issuecomment-3961254603
> That said, the potential performance improvements need to outweigh the slight increase in complexity I actually have some numbers! Before the change the pipeline took around 1~1.5TB of memory and ~200 cores. With the change it shaved 50~70 cores (not to mention the increased throughput). Of course there are other computation going on as well, but parquet writing and Flink RowData serdes showed up in profiler to take >90% CPU combined. Serdes was taking up around 75% CPU of the actual parquet writing. > Could you share a bit more about your use case My use case is that I have a firehose of data that I want to ingest into Iceberg. Because the volume is so high, it doesn't really matter which writer subtasks a record is routed to, there won't be small files either way. I was running DistributionMode.NONE, and noticed that serdes was taking up a ridiculous amount of resources, also caused a lot of unnecessary network shuffling. > adding a new DistributionMode I am a big fan of calling it ROUND_ROBIN instead, but are we not worried about breaking existing code? Maybe introduce ROUND_ROBIN as an alias for NONE, and this new mode can be called "PASSTROUGH" or something? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
