[GitHub] [iceberg] yegangy0718 commented on issue #7393: The serialization problem caused by Flink shuffling design

via GitHub Sun, 23 Apr 2023 22:09:37 -0700


yegangy0718 commented on issue #7393:
URL: https://github.com/apache/iceberg/issues/7393#issuecomment-1519392273


   Hi @huyuanfeng2018  Thanks for showing interest in the project. 
   
   We do have plan to add custom serializer for `DataStatisticsOrRecord ` as  
@stevenzwu  commented at 
https://github.com/apache/iceberg/pull/7269#discussion_r1157718810.  
   
   We have done perf test with the internal PoC impl. The result was published 
at 
https://www.slideshare.net/FlinkForward/tame-the-small-files-problem-and-optimize-data-layout-for-streaming-ingestion-to-iceberg
 from slide 44 to the end. We observed the CPU usage increased from 35% to 57% 
for the simplest streaming job(consumes from Kafka and writes to Iceberg) after 
applying shuffling. It's expected since we trade more CPU usage for better file 
size and data clustering. 
   
   We may need more information for the test cases you run like the Flink DAGs 
structure, the data distribution, and so on to analyze the perf impact that 
happens to you. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] yegangy0718 commented on issue #7393: The serialization problem caused by Flink shuffling design

Reply via email to