Lobo2008 opened a new issue, #2606:
URL: https://github.com/apache/uniffle/issues/2606

   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   
   
   ### Search before asking
   
   - [x] I have searched in the 
[issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### Describe the bug
   
   ### What is the problem?
   
   When using Uniffle with MapReduce jobs that have a Combiner, jobs often fail 
to complete. Analysis shows this is caused by severe GC overhead introduced by 
the map-stage combiner feature (#1301).
   
   ### Root Cause Analysis
   
   1.  **Architectural Mismatch**: The current implementation inherits the 
combiner logic from Hadoop's MapOutputBuffer, but does not consider the 
fundamental difference between disk-shuffle (Hadoop) and remote-shuffle 
(Uniffle).
   2.  **Execution Granularity**: In Hadoop, the combiner runs per-spill 
(default ~80MB).
   In Uniffle, the combiner runs on the entire send buffer, whose size is 
determined by mapreduce.task.io.sort.mb × 
mapreduce.rss.client.sort.memory.use.threshold.
   Even when this value is large (e.g., 512MB), jobs succeed with native ESS or 
when the combiner is disabled. Enabling the map-stage combiner is the trigger 
for severe GC overhead and job hangs, indicating the problem is intrinsic to 
the combiner logic, not the buffer size itself.
   3.  **GC Storm**: Processing hundreds of MBs of data in a single combiner 
run overwhelms the JVM heap, causing prolonged Full GC pauses.
   4.  **Deadlock Chain**: 
       - GC pauses stall the sender threads.
       - Stalled senders cannot free up memory (inSendListBytes remains high).
       - MapTask's collect() thread hits the memory limit and gets blocked 
(memoryUsedSize XXX is more than YYY warning).
       - This creates a deadlock: MapTask waits for memory, while sender 
threads wait for GC/combiner.
   
   
   
   ### Evidence
   
   - **Log Evidence**: Constant WARN logs: `"2025-09-10 19:28:58,789 WARN 
[main] org.apache.hadoop.mapred.SortWriteBufferManager: memoryUsedSize 
483183871 is more than 483183808, inSendListBytes 96636834`
   - **Code Evidence**: The combiner is unconditionally executed in 
`SortWriteBufferManager.prepareBufferForSend()`.
   - **Industry Practice**: Apache Celeborn, another major RSS, explicitly 
**does NOT** implement a map-stage combiner in its client 
(`CelebornMapOutputCollector`), focusing solely on efficient data pushing.
   
   ### Expected Behavior
   
   The map-stage combiner should be an **opt-in** feature for expert users, not 
an **opt-out** one that causes default instability. Most users should enjoy a 
stable, reliable job run by default.
   
   ### How to reproduce?
   
   Run any MapReduce job with a non-trivial Combiner and significant data 
volume on Uniffle. The job will likely hang or progress extremely slowly, with 
GC logs showing 90%+ GC overhead.
   
   ### Proposed Solution
   
   Introduce a configuration parameter (e.g., 
`mapreduce.rss.client.combiner.enable`) to control this feature. **The default 
value must be `false`** to ensure stability.
   
   This solution is:
   - **Simple and safe**: A minimal code change.
   - **User-friendly**: Provides stability by default, flexibility for experts.
   - **Aligned with RSS philosophy**: Offloads heavy computation from compute 
nodes.
   
   
   ### Affects Version(s)
   
   master
   
   ### Uniffle Server Log Output
   
   ```logtalk
   [2025-09-09 15:03:24.179] [checkResource-0] [WARN] ShuffleTaskManager - 
Remove expired preAllocatedBuffer[id=267241985] that required by app: 
appattempt_1755665307636_291096_000001
   ```
   
   ### Uniffle Engine Log Output
   
   ```logtalk
   2025-09-11 19:57:50,195 WARN [main] 
org.apache.hadoop.mapred.SortWriteBufferManager: memoryUsedSize 483183865 is 
more than 483183808, inSendListBytes 96636783
   2025-09-11 19:57:50,195 WARN [main] 
org.apache.hadoop.mapred.SortWriteBufferManager: memoryUsedSize 483183884 is 
more than 483183808, inSendListBytes 96637115
   2025-09-11 19:57:50,195 WARN [main] 
org.apache.hadoop.mapred.SortWriteBufferManager: memoryUsedSize 483183881 is 
more than 483183808, inSendListBytes 96637031
   
   
   
   Error: org.apache.uniffle.common.exception.RssException: Timeout: failed 
because 622182 blocks can't be sent to shuffle server in 600000 ms. at 
org.apache.hadoop.mapred.SortWriteBufferManager.waitSendFinished(SortWriteBufferManager.java:367)
 at 
org.apache.hadoop.mapred.RssMapOutputCollector.flush(RssMapOutputCollector.java:240)
 at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:755) 
at
   ```
   
   ### Uniffle Server Configurations
   
   ```yaml
   
   ```
   
   ### Uniffle Engine Configurations
   
   ```yaml
   
   ```
   
   ### Additional context
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@uniffle.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@uniffle.apache.org
For additional commands, e-mail: issues-h...@uniffle.apache.org

Reply via email to