[GitHub] [iceberg] huyuanfeng2018 commented on issue #7393: The serialization problem caused by Flink shuffling design

via GitHub Mon, 24 Apr 2023 19:45:27 -0700


huyuanfeng2018 commented on issue #7393:
URL: https://github.com/apache/iceberg/issues/7393#issuecomment-1521084658


   1. Regarding the 4x slowdown, this is indeed a four-fold decrease in 
throughput. As a user, I am more concerned about the throughput under the same 
resources, so this does not mean that the CPU consumption is increased 4x.
   
   2. Regarding the custom range partitioner, I implemented a custom range 
partitioner myself. As described in my previous reply, we will specify the 
proportion of each enumeration partition to the total data volume before the 
task starts (distribution-balance-column- ratio), and then use the custom range 
partitioner to calculate which streamWrite subtasks the data of each 
enumeration type is distributed to. According to #6382 and #7269, I then 
realized the dynamic adjustment of the total data of each enumeration partition 
based on each checkpoint statistics The proportion of 
volume(distribution-balance-column- ratio), so as to achieve the purpose of 
dynamic allocation, from the metric on the Flink ui, my data shuffling is very 
uniform, from the monitoring point of view, the cpu occupancy rate of all my tm 
is 100%, the following two pictures These are some of my monitoring data at 
this time
   <img width="623" alt="image" 
src="https://user-images.githubusercontent.com/40817998/234160124-aeb88a2d-953c-45be-a299-fc5fb4189c78.png";>
   
   <img width="1093" alt="image" 
src="https://user-images.githubusercontent.com/40817998/234160387-baeaddcd-1697-4b5d-9d0c-22ba8a915257.png";>
   3. In fact, when I doubled the parallelism, I found that my processing 
efficiency also doubled like this, so it can be determined with a high 
probability that it is the bottleneck of the cpu rather than the network io:
   <img width="1364" alt="image" 
src="https://user-images.githubusercontent.com/40817998/234161033-47b26d7d-81a0-4205-948c-f0544d32c203.png";>
   
   4. In fact, there is one thing I have doubts about. Why doesn't the 
DataStatisticsOrRecord object directly implement Flink Rowdata? Because the 
generic type here has been determined to be Rowdata when using Flink to write
   ```
   public class RowDataTaskWriterFactory implements TaskWriterFactory<RowData>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] huyuanfeng2018 commented on issue #7393: The serialization problem caused by Flink shuffling design

Reply via email to