[PR] [VL] Optimize the performance of hash based shuffle by accumulating batches [incubator-gluten]

via GitHub Sun, 02 Jun 2024 01:57:30 -0700


XinShuoWang opened a new pull request, #5951:
URL: https://github.com/apache/incubator-gluten/pull/5951


   ## What changes were proposed in this pull request?
   I used perf to observe the benchmark and found that the most time-consuming 
functions were `splitFixedWidthValueBuffer` and `splitBinaryType`. However, 
current column computing engines (such as starrocks) also use this idea of 
exchanging random read memory overhead for sequential write memory overhead 
to implement the `split` function, so I think there is not much room for 
optimization of the `split` function.
   
   <img width="1213" alt="截屏2024-06-02 16 39 30" 
src="https://github.com/apache/incubator-gluten/assets/56379080/524dd7d2-1e91-4ccb-a928-23626a331736";>
   
   I found that when the ShuffleBatchSize is increased, the performance will be 
significantly improved. I think the performance benefits mainly come from the 
following aspects:
   
   1. It can give full play to the advantages of sequential memory writing in 
the split stage. When PartitionNum is 10000 and ShuffleBatchSize is 4096 (the 
default value in the benchmark), each Partition is only allocated 1 row of data 
at most (the data obtained by logging statistics in the benchmark). At this 
time, it is obviously impossible to give full play to the advantages of 
sequential memory writing.
   
   2. It can reduce the number of function calls and the number of memory 
allocations.
   
   Therefore, I implemented this PR to cache the data to be Shuffled, which can 
optimize the performance of ShuffleWrite. For specific test data, please refer 
to the screenshot below.
   
   I think this PR can also control whether to cache data in combination with 
memory usage, thereby avoiding the ShuffleWrite OOM problem.
   
   
   ## How was this patch tested?
   ### Command
   ```
   ./build/velox/benchmarks/shuffle_split_benchmark 
--file=/root/shuffleSplitBenchmark/cpp/velox/benchmarks/data/tpch_sf10m/lineitem/part-00000-6c374e0a-7d76-401b-8458-a8e31f8ab704-c000.snappy.parquet
 --partitions=10000 --iterations=100 --threads=1
   ```
   
   ### Before optimize
   <img width="1775" alt="截屏2024-06-02 16 13 43" 
src="https://github.com/apache/incubator-gluten/assets/56379080/95a3d7a5-021a-4daa-abfe-aa9c96268077";>
   
   ### After optimize
   <img width="1778" alt="截屏2024-06-02 16 18 28" 
src="https://github.com/apache/incubator-gluten/assets/56379080/a989d3c3-3aa8-4b6c-9159-844f73044d33";>
   
   
   
   (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [VL] Optimize the performance of hash based shuffle by accumulating batches [incubator-gluten]

Reply via email to