wForget commented on PR #12259:
URL: https://github.com/apache/gluten/pull/12259#issuecomment-4656030239

   Thanks @Xtpacz . The current implementation merges the build side into a 
single `ColumnBatch`, which does not seem to be a general solution.
   
   > Verified on an internal Spark cluster running TPC-DS 5TB. On q64, with all 
other configs equal, the aggregated HashBuild total time dropped from 2.04h to 
22.1min when mergeBatches=true (~5.5x reduction).
   
   That said, based on the benchmark results you shared, the performance 
improvement is quite significant. Could you share more details about the 
previous bottleneck? Was the main issue caused by generating a large number of 
small `ColumnBatch` instances, or was there another factor contributing to the 
overhead?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to