Xtpacz opened a new issue, #12251:
URL: https://github.com/apache/gluten/issues/12251

   ### Description
   
   ## Summary
   We observed a ~15x performance regression in BroadcastHashJoin's HashBuild 
step when running TPC-DS 5TB on Gluten 1.6 vs Gluten 1.2 (Velox backend).
   
   
   ## Reproduction
   - Workload: TPC-DS 5TB, Q64
   - Same cluster, same configurations on both versions
   
   | Version | time of hash build total |
   |---------|--------------------------|
   | Gluten 1.2 | 8.0 min |
   | Gluten 1.6 | 2.04 hours (~15x slower) |
   
   ### Spark configuration
   
   --num-executors 500
   spark.executor.cores = 4
   spark.executor.memory = 2g
   spark.executor.memoryOverhead = 2g
   spark.memory.offHeap.enabled = true
   spark.memory.offHeap.size = 8g
   spark.driver.cores = 4
   spark.driver.memory = 10g
   spark.driver.maxResultSize = 3g
   
   spark.sql.autoBroadcastJoinThreshold = 128MB
   spark.sql.adaptive.autoBroadcastJoinThreshold = 128MB
   spark.sql.shuffle.partitions = 760
   spark.io.compression.codec = lz4
   spark.serializer = org.apache.spark.serializer.KryoSerializer
   spark.kryoserializer.buffer.max = 2000m
   
   spark.plugins = org.apache.gluten.GlutenPlugin
   spark.sql.extensions = org.apache.spark.sql.gluten.extension.GlutenExtension
   spark.gluten.sql.columnar.backend.lib = velox
   spark.shuffle.manager = org.apache.spark.shuffle.sort.ColumnarShuffleManager
   spark.gluten.enabled = true
   spark.gluten.sql.columnar.shuffle.enabled = true
   
   ### Spark UI screenshots
   gluten1.6
   <img width="420" height="683" alt="Image" 
src="https://github.com/user-attachments/assets/83956494-917b-4f94-bd44-526f0769ed27";
 />
   
   gluten1.2
   <img width="452" height="612" alt="Image" 
src="https://github.com/user-attachments/assets/02b617a2-174f-416f-9a32-8cd80d815bb7";
 />
   
   
   
   ## Expected behavior
   HashBuild time on 1.6 should be on par with 1.2 for the same workload. At a 
minimum, it shouldn't be this much slower.
   
   ## Actual behavior
   HashBuild time degrades ~15x with no change in input rows.
   
   
     ## Root cause
     Bisected to PR #9521 (commit 3e9989da1): [GLUTEN-9475][VL] Serialize 
ColumnarBatch one by one to reduce memory footprint when broadcasting
   
     Behavior change:
     | | 1.2 | 1.6 |
     |---|---|---|
     | Driver-side JNI calls during broadcast serialize | 1 | N |
     | Broadcast payload (`Array[Array[Byte]]`) length | 1 | N |
     | Executor `deserialize` calls | 1 | N |
     | ColumnarBatches fed to HashBuild | 1 (large) | N (small) |
     | `HashBuild.addInput` calls | 1 | N |
   
   `HashBuild::addInput` has fixed per-call overhead independent of row count 
(memory reservation check, DecodedVector init, NULL/stats locking, spill 
check). For Q64 with ~2B rows spread across many batches, this fixed cost gets 
multiplied by N and dominates total wall time.
   
   ## Tradeoff
   PR #9521 was introduced to reduce driver-side native memory peak for huge 
broadcasts (issue #9475). That tradeoff is valid for very large broadcasts that 
would OOM on 1.2, but for typical workloads it introduces a heavy CPU cost. The 
right fix is **not** to revert #9521 — both
   behaviors have valid use cases.
   
   ## Proposed fix: keep both paths behind a config switch
   Add `spark.gluten.velox.broadcastBuild.mergeBatches` (boolean, default 
`false`):
   - `false` (default): keep 1.6's per-batch serialization (preserves OOM fix 
from #9521)
   - `true`: collect all batch handles, issue **one** JNI call to a new 
`serializeAll(long[])`, producing 1 combined buffer (matches 1.2 behavior)
   
   The new JNI is **additive** — existing `serialize(long)` is preserved, so 
`ColumnarCachedBatchSerializer` and other callers are unaffected.
   
   I have a working patch. On the same Q64 reproduction:
   - `mergeBatches=false` → HashBuild ~2.04 h (1.6 baseline)
   - `mergeBatches=true`  → HashBuild ~22 min (matches 1.2)
   when I set mergeBatches=true:
   <img width="422" height="681" alt="Image" 
src="https://github.com/user-attachments/assets/57aa8a63-2821-4db5-9655-eb9bd71773d9";
 />
   
   Both paths produce semantically equivalent broadcast results. Happy to open 
a PR for review.
   
   
   
   
   ### Gluten version
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to