Xtpacz commented on PR #12259: URL: https://github.com/apache/gluten/pull/12259#issuecomment-4660820747
> Thanks @Xtpacz . The current implementation merges the build side into a single `ColumnBatch`, which does not seem to be a general solution. > > > Verified on an internal Spark cluster running TPC-DS 5TB. On q64, with all other configs equal, the aggregated HashBuild total time dropped from 2.04h to 22.1min when mergeBatches=true (~5.5x reduction). > > That said, based on the benchmark results you shared, the performance improvement is quite significant. Could you share more details about the previous bottleneck? Was the main issue caused by generating a large number of small `ColumnBatch` instances, or was there another factor contributing to the overhead? @wForget Thanks for the review! **Root cause:** The per batch's serialize/deserialize overhead will across the full pipeline. In our q64 case (19.6B build rows, maxBatchSize=4096), this produces about 480K independent buffers. Each one goes through PrestoSerializer creation + ArrowBuf allocation on serialize, and PrestoVectorSerde.deserialize() + small-vector HashBuild on executor side. The executor-side cost dominates — small vectors (4096 rows) have poor vectorization efficiency and cache locality for hash table building. **On generality:** This is not a new optimization — Gluten 1.2 used this exact merged serialization path. PR #9521 changed to per-batch to reduce native memory peak, which inadvertently caused this regression. Our patch simply restores the 1.2 behavior behind a config switch (default=false), keeping #9521's OOM-safe path as default. If the team prefers a middle ground, we can do a hybrid — merge in groups of N batches to amortize overhead without holding the full partition in native memory. Happy to implement if preferred. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
