wForget commented on PR #12259: URL: https://github.com/apache/gluten/pull/12259#issuecomment-4656030239
Thanks @Xtpacz . The current implementation merges the build side into a single `ColumnBatch`, which does not seem to be a general solution. > Verified on an internal Spark cluster running TPC-DS 5TB. On q64, with all other configs equal, the aggregated HashBuild total time dropped from 2.04h to 22.1min when mergeBatches=true (~5.5x reduction). That said, based on the benchmark results you shared, the performance improvement is quite significant. Could you share more details about the previous bottleneck? Was the main issue caused by generating a large number of small `ColumnBatch` instances, or was there another factor contributing to the overhead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
