Xtpacz opened a new issue, #12251: URL: https://github.com/apache/gluten/issues/12251
### Description ## Summary We observed a ~15x performance regression in BroadcastHashJoin's HashBuild step when running TPC-DS 5TB on Gluten 1.6 vs Gluten 1.2 (Velox backend). ## Reproduction - Workload: TPC-DS 5TB, Q64 - Same cluster, same configurations on both versions | Version | time of hash build total | |---------|--------------------------| | Gluten 1.2 | 8.0 min | | Gluten 1.6 | 2.04 hours (~15x slower) | ### Spark configuration --num-executors 500 spark.executor.cores = 4 spark.executor.memory = 2g spark.executor.memoryOverhead = 2g spark.memory.offHeap.enabled = true spark.memory.offHeap.size = 8g spark.driver.cores = 4 spark.driver.memory = 10g spark.driver.maxResultSize = 3g spark.sql.autoBroadcastJoinThreshold = 128MB spark.sql.adaptive.autoBroadcastJoinThreshold = 128MB spark.sql.shuffle.partitions = 760 spark.io.compression.codec = lz4 spark.serializer = org.apache.spark.serializer.KryoSerializer spark.kryoserializer.buffer.max = 2000m spark.plugins = org.apache.gluten.GlutenPlugin spark.sql.extensions = org.apache.spark.sql.gluten.extension.GlutenExtension spark.gluten.sql.columnar.backend.lib = velox spark.shuffle.manager = org.apache.spark.shuffle.sort.ColumnarShuffleManager spark.gluten.enabled = true spark.gluten.sql.columnar.shuffle.enabled = true ### Spark UI screenshots gluten1.6 <img width="420" height="683" alt="Image" src="https://github.com/user-attachments/assets/83956494-917b-4f94-bd44-526f0769ed27" /> gluten1.2 <img width="452" height="612" alt="Image" src="https://github.com/user-attachments/assets/02b617a2-174f-416f-9a32-8cd80d815bb7" /> ## Expected behavior HashBuild time on 1.6 should be on par with 1.2 for the same workload. At a minimum, it shouldn't be this much slower. ## Actual behavior HashBuild time degrades ~15x with no change in input rows. ## Root cause Bisected to PR #9521 (commit 3e9989da1): [GLUTEN-9475][VL] Serialize ColumnarBatch one by one to reduce memory footprint when broadcasting Behavior change: | | 1.2 | 1.6 | |---|---|---| | Driver-side JNI calls during broadcast serialize | 1 | N | | Broadcast payload (`Array[Array[Byte]]`) length | 1 | N | | Executor `deserialize` calls | 1 | N | | ColumnarBatches fed to HashBuild | 1 (large) | N (small) | | `HashBuild.addInput` calls | 1 | N | `HashBuild::addInput` has fixed per-call overhead independent of row count (memory reservation check, DecodedVector init, NULL/stats locking, spill check). For Q64 with ~2B rows spread across many batches, this fixed cost gets multiplied by N and dominates total wall time. ## Tradeoff PR #9521 was introduced to reduce driver-side native memory peak for huge broadcasts (issue #9475). That tradeoff is valid for very large broadcasts that would OOM on 1.2, but for typical workloads it introduces a heavy CPU cost. The right fix is **not** to revert #9521 — both behaviors have valid use cases. ## Proposed fix: keep both paths behind a config switch Add `spark.gluten.velox.broadcastBuild.mergeBatches` (boolean, default `false`): - `false` (default): keep 1.6's per-batch serialization (preserves OOM fix from #9521) - `true`: collect all batch handles, issue **one** JNI call to a new `serializeAll(long[])`, producing 1 combined buffer (matches 1.2 behavior) The new JNI is **additive** — existing `serialize(long)` is preserved, so `ColumnarCachedBatchSerializer` and other callers are unaffected. I have a working patch. On the same Q64 reproduction: - `mergeBatches=false` → HashBuild ~2.04 h (1.6 baseline) - `mergeBatches=true` → HashBuild ~22 min (matches 1.2) when I set mergeBatches=true: <img width="422" height="681" alt="Image" src="https://github.com/user-attachments/assets/57aa8a63-2821-4db5-9655-eb9bd71773d9" /> Both paths produce semantically equivalent broadcast results. Happy to open a PR for review. ### Gluten version None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
