yjshen opened a new issue, #11070:
URL: https://github.com/apache/incubator-gluten/issues/11070

   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   ## Problem Description
   
   We encountered a task failure when running the same job on Gluten that 
succeeds on vanilla Spark. The Gluten task fails with:
   
   ```
   [2025-11-10 17:59:35.963] Container exited with a non-zero exit code 137.
   [2025-11-10 17:59:35.965] Killed by external signal
   ```
   
   Exit code 137 typically indicates the process was killed by the OS (OOM 
killer).
   
   ## Questions
   
   1. **Why is Gluten's spill memory ~64GB while Spark's is ~4GB?**
      - Is this a metrics reporting issue, or is Gluten actually spilling 
significantly more data to memory?
      - The 16x difference seems unusually large for the same workload.
   
   2. **Is this a Sort/Spill issue rather than a Window function issue?**
      - Based on the operator sequence, both implementations perform Sort 
before Window.
      - We suspect the issue is more related to the sort and spilling mechanism 
rather than the window function itself.  #10500
      - Is this assessment correct?
   
   3. **What are the potential root causes and workarounds?**
      - What configuration tuning would you recommend?
      - Are there known issues with Gluten's spilling mechanism under memory 
pressure?
      - Should we adjust the off-heap vs on-heap memory allocation or other 
shuffle or spill-related configurations?
   
   
   ## Key Observations
   
   ### Spill Memory Discrepancy
   
   The most significant difference we observed is in spill memory metrics:
   
   - **Gluten**: Spill (Memory) ~64GB per task
   - **Vanilla Spark**: Spill (Memory) ~4GB per task
   
   This represents approximately **16x higher memory spilling** in Gluten 
compared to vanilla Spark for the same workload.
   
   ## Configuration
   
   ### Spark Configuration
   ```
   Executor Reqs:
       memoryOverhead: 8192 MB
       cores: 15
       memory: 81920 MB
       offHeap: 0 MB
   
   Task Reqs:
       cpus: 1.0
   
   spark.sql.shuffle.partitions: 200 (default)
   ```
   
   ### Gluten Configuration
   ```
   Executor Reqs:
       memoryOverhead: 8192 MB
       cores: 15
       memory: 20725 MB
       offHeap: 60375 MB
   
   Task Reqs:
       cpus: 1.0
   
   spark.sql.shuffle.partitions: 200 (default)
   ```
   
   **Note**: Total memory is similar (~90GB for Spark, ~89GB for Gluten), but 
allocated differently between on-heap and off-heap.
   
   ## Stage Operators Comparison
   
   ### Vanilla Spark Operators (Successful)
   ```
   (119) ReusedExchange [Reuses operator id: 14]
   (120) ShuffleQueryStage
   (121) Sort [codegen id : 33]
         Arguments: [transid#1096 ASC NULLS FIRST, dt#1191 DESC NULLS LAST, 
txn_date#62 DESC NULLS LAST, ts#1177 DESC NULLS LAST]
   (122) Window
         Arguments: [row_number() windowspecdefinition(transid#1096, dt#1191 
DESC NULLS LAST, txn_date#62 DESC NULLS LAST, ts#1177 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
rank#176]
   (123) Filter [codegen id : 34]
         Condition: (rank#176 = 1)
   (124) Project [codegen id : 34]
   (125) Exchange
   ```
   
   ### Gluten Operators (Failed)
   
   ```
   (189) ShuffleQueryStage
   (190) InputAdapter
   (191) InputIteratorTransformer
   (192) SortExecTransformer
         Arguments: [transid#1096 ASC NULLS FIRST, dt#1191 DESC NULLS LAST, 
txn_date#62 DESC NULLS LAST, ts#1177 DESC NULLS LAST]
   (193) WindowExecTransformer
         Arguments: [row_number() windowspecdefinition(transid#1096, dt#1191 
DESC NULLS LAST, txn_date#62 DESC NULLS LAST, ts#1177 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
rank#176]
   (194) FilterExecTransformer
   (195) ProjectExecTransformer
   (196) WholeStageCodegenTransformer (37)
   (197) VeloxResizeBatches
   (198) ColumnarExchange
   ```
   
   ## Metrics Comparison
   
   ### Gluten - Sample Successful Tasks (before failure)
   
   | Index | Task ID | Duration | GC Time | Peak Execution Memory | Shuffle 
Write Size/Records | Shuffle Read Size/Records | Spill (Memory) | Spill (Disk) 
| Shuffle Remote Reads |
   
|-------|---------|----------|---------|----------------------|---------------------------|--------------------------|----------------|--------------|---------------------|
   | 5 | 345199 | 4.9 min | 2 s | 3.2 GiB | 366.1 MiB / 20,684,642 | 703.5 MiB 
/ 39,438,747 | 62.8 GiB | 1.2 GiB | 670.2 MiB |
   | 6 | 345201 | 3.8 min | 3 s | 3.2 GiB | 366 MiB / 20,681,664 | 703.4 MiB / 
39,431,462 | 62.8 GiB | 1.2 GiB | 662 MiB |
   | 7 | 345203 | 5.3 min | 4 s | 3.2 GiB | 366 MiB / 20,676,880 | 703.2 MiB / 
39,421,370 | 62.8 GiB | 1.2 GiB | 677.9 MiB |
   
   ### Gluten - Summary Metrics for 47 Completed Tasks
   
   | Metric | Min | 25th percentile | Median | 75th percentile | Max |
   |--------|-----|-----------------|--------|-----------------|-----|
   | Duration | 3.5 min | 5.2 min | 6.0 min | 6.6 min | 8.0 min |
   | GC Time | 0.1 s | 1 s | 3 s | 4 s | 4 s |
   | Peak Execution Memory | 2.6 GiB | 2.6 GiB | 3.2 GiB | 3.2 GiB | 3.3 GiB |
   | **Spill (Memory)** | **62.7 GiB** | **62.8 GiB** | **62.8 GiB** | **62.9 
GiB** | **62.9 GiB** |
   | Spill (Disk) | 1.2 GiB | 1.2 GiB | 1.2 GiB | 1.2 GiB | 1.2 GiB |
   | Shuffle Read Size/Records | 703.2 MiB / 39,421,370 | 703.4 MiB / 
39,432,615 | 703.5 MiB / 39,437,423 | 703.6 MiB / 39,443,119 | 703.9 MiB / 
39,460,186 |
   | Shuffle Write Size/Records | 366 MiB / 20,676,880 | 366 MiB / 20,682,012 | 
366.1 MiB / 20,684,419 | 366.1 MiB / 20,686,668 | 366.3 MiB / 20,695,719 |
   
   ### Vanilla Spark - Sample Successful Tasks
   
   | Index | Task ID | Duration | GC Time | Peak Execution Memory | Shuffle 
Write Size/Records | Shuffle Read Size/Records | Spill (Memory) | Spill (Disk) 
| Shuffle Remote Reads |
   
|-------|---------|----------|---------|----------------------|---------------------------|--------------------------|----------------|--------------|---------------------|
   | 5 | 341338 | 3.6 min | 3 s | 4 GiB | 596.8 MiB / 20,555,139 | 1.2 GiB / 
39,533,494 | 5.5 GiB | 1.1 GiB | 1.1 GiB |
   | 6 | 341340 | 3.8 min | 7 s | 3.6 GiB | 596.7 MiB / 20,552,224 | 1.2 GiB / 
39,525,019 | 5.8 GiB | 1.2 GiB | 1.1 GiB |
   | 7 | 341342 | 3.4 min | 6 s | 3.7 GiB | 596.6 MiB / 20,547,334 | 1.2 GiB / 
39,515,478 | 2.9 GiB | 615.7 MiB | 1.1 GiB |
   
   ### Vanilla Spark - Summary Metrics for 200 Completed Tasks
   
   | Metric | Min | 25th percentile | Median | 75th percentile | Max |
   |--------|-----|-----------------|--------|-----------------|-----|
   | Duration | 3.0 min | 3.2 min | 3.3 min | 3.5 min | 4.4 min |
   | GC Time | 2 s | 3 s | 5 s | 7 s | 9 s |
   | Peak Execution Memory | 3 GiB | 3.4 GiB | 3.6 GiB | 4 GiB | 6.8 GiB |
   | **Spill (Memory)** | **2.8 GiB** | **3.4 GiB** | **3.7 GiB** | **5.2 GiB** 
| **6.1 GiB** |
   | Spill (Disk) | 594.8 MiB | 694.2 MiB | 747.8 MiB | 1.1 GiB | 1.2 GiB |
   | Shuffle Read Size/Records | 1.2 GiB / 39,511,291 | 1.2 GiB / 39,526,316 | 
1.2 GiB / 39,532,177 | 1.2 GiB / 39,538,317 | 1.2 GiB / 39,553,345 |
   | Shuffle Write Size/Records | 596.5 MiB / 20,543,879 | 596.7 MiB / 
20,552,224 | 596.8 MiB / 20,555,023 | 596.9 MiB / 20,558,078 | 597.1 MiB / 
20,566,005 |
   
   
   
   ## Additional Context
   
   The workload involves a window function with `row_number()` partitioned by 
`transid` and ordered by multiple columns (`dt DESC, txn_date DESC, ts DESC`), 
followed by filtering for rank = 1 (deduplication pattern).
   
   Any insights into why Gluten exhibits such different spilling behavior would 
be greatly appreciated.
   
   
   ## Gluten
   
   <img width="611" height="1826" alt="Image" 
src="https://github.com/user-attachments/assets/6e032494-2706-4ec4-a603-25ce2ea8c056";
 />
   
   ```
   (119) ReusedExchange [Reuses operator id: 14]
   Output [8]: [transid#1096, txn_date#62, trans_type#1101, trans_status#1111, 
cc_id#1117, realtime_processor#63, dt#1191, ts#1177]
   
   (120) ShuffleQueryStage
   Output [8]: [transid#1096, txn_date#62, trans_type#1101, trans_status#1111, 
cc_id#1117, realtime_processor#63, dt#1191, ts#1177]
   Arguments: 10
   
   (121) Sort [codegen id : 33]
   Input [8]: [transid#1096, txn_date#62, trans_type#1101, trans_status#1111, 
cc_id#1117, realtime_processor#63, dt#1191, ts#1177]
   Arguments: [transid#1096 ASC NULLS FIRST, dt#1191 DESC NULLS LAST, 
txn_date#62 DESC NULLS LAST, ts#1177 DESC NULLS LAST], false, 0
   
   (122) Window
   Input [8]: [transid#1096, txn_date#62, trans_type#1101, trans_status#1111, 
cc_id#1117, realtime_processor#63, dt#1191, ts#1177]
   Arguments: [row_number() windowspecdefinition(transid#1096, dt#1191 DESC 
NULLS LAST, txn_date#62 DESC NULLS LAST, ts#1177 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
rank#176], [transid#1096], [dt#1191 DESC NULLS LAST, txn_date#62 DESC NULLS 
LAST, ts#1177 DESC NULLS LAST]
   
   (123) Filter [codegen id : 34]
   Input [9]: [transid#1096, txn_date#62, trans_type#1101, trans_status#1111, 
cc_id#1117, realtime_processor#63, dt#1191, ts#1177, rank#176]
   Condition : (rank#176 = 1)
   
   (124) Project [codegen id : 34]
   Output [6]: [transid#1096, txn_date#62, trans_type#1101, trans_status#1111, 
cc_id#1117, realtime_processor#63]
   Input [9]: [transid#1096, txn_date#62, trans_type#1101, trans_status#1111, 
cc_id#1117, realtime_processor#63, dt#1191, ts#1177, rank#176]
   
   (125) Exchange
   Input [6]: [transid#1096, txn_date#62, trans_type#1101, trans_status#1111, 
cc_id#1117, realtime_processor#63]
   Arguments: hashpartitioning(cc_id#1117, 200), ENSURE_REQUIREMENTS, 
[plan_id=18453]
   ```
   
   
   
   ## Vanilla Spark
   
   <img width="502" height="1824" alt="Image" 
src="https://github.com/user-attachments/assets/b5af10ab-aace-4850-9107-12c58ccf0bdd";
 />
   
   
   ```
   (189) ShuffleQueryStage
   Output [8]: [transid#1096, txn_date#62, trans_type#1101, trans_status#1111, 
cc_id#1117, realtime_processor#63, dt#1191, ts#1177]
   Arguments: 11
   
   (190) InputAdapter
   Input [8]: [transid#1096, txn_date#62, trans_type#1101, trans_status#1111, 
cc_id#1117, realtime_processor#63, dt#1191, ts#1177]
   
   (191) InputIteratorTransformer
   Input [8]: [transid#1096, txn_date#62, trans_type#1101, trans_status#1111, 
cc_id#1117, realtime_processor#63, dt#1191, ts#1177]
   
   (192) SortExecTransformer
   Input [8]: [transid#1096, txn_date#62, trans_type#1101, trans_status#1111, 
cc_id#1117, realtime_processor#63, dt#1191, ts#1177]
   Arguments: [transid#1096 ASC NULLS FIRST, dt#1191 DESC NULLS LAST, 
txn_date#62 DESC NULLS LAST, ts#1177 DESC NULLS LAST], false, 0
   
   (193) WindowExecTransformer
   Input [8]: [transid#1096, txn_date#62, trans_type#1101, trans_status#1111, 
cc_id#1117, realtime_processor#63, dt#1191, ts#1177]
   Arguments: [row_number() windowspecdefinition(transid#1096, dt#1191 DESC 
NULLS LAST, txn_date#62 DESC NULLS LAST, ts#1177 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
rank#176], [transid#1096], [dt#1191 DESC NULLS LAST, txn_date#62 DESC NULLS 
LAST, ts#1177 DESC NULLS LAST]
   
   (194) FilterExecTransformer
   Input [9]: [transid#1096, txn_date#62, trans_type#1101, trans_status#1111, 
cc_id#1117, realtime_processor#63, dt#1191, ts#1177, rank#176]
   Arguments: (rank#176 = 1)
   
   (195) ProjectExecTransformer
   Output [7]: [hash(cc_id#1117, 42) AS hash_partition_key#3916, transid#1096, 
txn_date#62, trans_type#1101, trans_status#1111, cc_id#1117, 
realtime_processor#63]
   Input [9]: [transid#1096, txn_date#62, trans_type#1101, trans_status#1111, 
cc_id#1117, realtime_processor#63, dt#1191, ts#1177, rank#176]
   
   (196) WholeStageCodegenTransformer (37)
   Input [7]: [hash_partition_key#3916, transid#1096, txn_date#62, 
trans_type#1101, trans_status#1111, cc_id#1117, realtime_processor#63]
   Arguments: false
   
   (197) VeloxResizeBatches
   Input [7]: [hash_partition_key#3916, transid#1096, txn_date#62, 
trans_type#1101, trans_status#1111, cc_id#1117, realtime_processor#63]
   Arguments: 1024, 2147483647
   
   (198) ColumnarExchange
   Input [7]: [hash_partition_key#3916, transid#1096, txn_date#62, 
trans_type#1101, trans_status#1111, cc_id#1117, realtime_processor#63]
   Arguments: hashpartitioning(cc_id#1117, 200), ENSURE_REQUIREMENTS, 
[transid#1096, txn_date#62, trans_type#1101, trans_status#1111, cc_id#1117, 
realtime_processor#63], [plan_id=20642], [shuffle_writer_type=hash]
   ```
   
   
   ### Gluten version
   
   Gluten-1.3
   
   ### Spark version
   
   Spark-3.5.x
   
   ### Spark configurations
   
   _No response_
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   ```bash
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to