XinShuoWang opened a new pull request, #5951: URL: https://github.com/apache/incubator-gluten/pull/5951
## What changes were proposed in this pull request? I used perf to observe the benchmark and found that the most time-consuming functions were `splitFixedWidthValueBuffer` and `splitBinaryType`. However, current column computing engines (such as starrocks) also use this idea of exchanging random read memory overhead for sequential write memory overhead to implement the `split` function, so I think there is not much room for optimization of the `split` function. <img width="1213" alt="截屏2024-06-02 16 39 30" src="https://github.com/apache/incubator-gluten/assets/56379080/524dd7d2-1e91-4ccb-a928-23626a331736"> I found that when the ShuffleBatchSize is increased, the performance will be significantly improved. I think the performance benefits mainly come from the following aspects: 1. It can give full play to the advantages of sequential memory writing in the split stage. When PartitionNum is 10000 and ShuffleBatchSize is 4096 (the default value in the benchmark), each Partition is only allocated 1 row of data at most (the data obtained by logging statistics in the benchmark). At this time, it is obviously impossible to give full play to the advantages of sequential memory writing. 2. It can reduce the number of function calls and the number of memory allocations. Therefore, I implemented this PR to cache the data to be Shuffled, which can optimize the performance of ShuffleWrite. For specific test data, please refer to the screenshot below. I think this PR can also control whether to cache data in combination with memory usage, thereby avoiding the ShuffleWrite OOM problem. ## How was this patch tested? ### Command ``` ./build/velox/benchmarks/shuffle_split_benchmark --file=/root/shuffleSplitBenchmark/cpp/velox/benchmarks/data/tpch_sf10m/lineitem/part-00000-6c374e0a-7d76-401b-8458-a8e31f8ab704-c000.snappy.parquet --partitions=10000 --iterations=100 --threads=1 ``` ### Before optimize <img width="1775" alt="截屏2024-06-02 16 13 43" src="https://github.com/apache/incubator-gluten/assets/56379080/95a3d7a5-021a-4daa-abfe-aa9c96268077"> ### After optimize <img width="1778" alt="截屏2024-06-02 16 18 28" src="https://github.com/apache/incubator-gluten/assets/56379080/a989d3c3-3aa8-4b6c-9159-844f73044d33"> (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
