Lobo2008 opened a new issue, #2606: URL: https://github.com/apache/uniffle/issues/2606
### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) ### Search before asking - [x] I have searched in the [issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and found no similar issues. ### Describe the bug ### What is the problem? When using Uniffle with MapReduce jobs that have a Combiner, jobs often fail to complete. Analysis shows this is caused by severe GC overhead introduced by the map-stage combiner feature (#1301). ### Root Cause Analysis 1. **Architectural Mismatch**: The current implementation inherits the combiner logic from Hadoop's MapOutputBuffer, but does not consider the fundamental difference between disk-shuffle (Hadoop) and remote-shuffle (Uniffle). 2. **Execution Granularity**: In Hadoop, the combiner runs per-spill (default ~80MB). In Uniffle, the combiner runs on the entire send buffer, whose size is determined by mapreduce.task.io.sort.mb × mapreduce.rss.client.sort.memory.use.threshold. Even when this value is large (e.g., 512MB), jobs succeed with native ESS or when the combiner is disabled. Enabling the map-stage combiner is the trigger for severe GC overhead and job hangs, indicating the problem is intrinsic to the combiner logic, not the buffer size itself. 3. **GC Storm**: Processing hundreds of MBs of data in a single combiner run overwhelms the JVM heap, causing prolonged Full GC pauses. 4. **Deadlock Chain**: - GC pauses stall the sender threads. - Stalled senders cannot free up memory (inSendListBytes remains high). - MapTask's collect() thread hits the memory limit and gets blocked (memoryUsedSize XXX is more than YYY warning). - This creates a deadlock: MapTask waits for memory, while sender threads wait for GC/combiner. ### Evidence - **Log Evidence**: Constant WARN logs: `"2025-09-10 19:28:58,789 WARN [main] org.apache.hadoop.mapred.SortWriteBufferManager: memoryUsedSize 483183871 is more than 483183808, inSendListBytes 96636834` - **Code Evidence**: The combiner is unconditionally executed in `SortWriteBufferManager.prepareBufferForSend()`. - **Industry Practice**: Apache Celeborn, another major RSS, explicitly **does NOT** implement a map-stage combiner in its client (`CelebornMapOutputCollector`), focusing solely on efficient data pushing. ### Expected Behavior The map-stage combiner should be an **opt-in** feature for expert users, not an **opt-out** one that causes default instability. Most users should enjoy a stable, reliable job run by default. ### How to reproduce? Run any MapReduce job with a non-trivial Combiner and significant data volume on Uniffle. The job will likely hang or progress extremely slowly, with GC logs showing 90%+ GC overhead. ### Proposed Solution Introduce a configuration parameter (e.g., `mapreduce.rss.client.combiner.enable`) to control this feature. **The default value must be `false`** to ensure stability. This solution is: - **Simple and safe**: A minimal code change. - **User-friendly**: Provides stability by default, flexibility for experts. - **Aligned with RSS philosophy**: Offloads heavy computation from compute nodes. ### Affects Version(s) master ### Uniffle Server Log Output ```logtalk [2025-09-09 15:03:24.179] [checkResource-0] [WARN] ShuffleTaskManager - Remove expired preAllocatedBuffer[id=267241985] that required by app: appattempt_1755665307636_291096_000001 ``` ### Uniffle Engine Log Output ```logtalk 2025-09-11 19:57:50,195 WARN [main] org.apache.hadoop.mapred.SortWriteBufferManager: memoryUsedSize 483183865 is more than 483183808, inSendListBytes 96636783 2025-09-11 19:57:50,195 WARN [main] org.apache.hadoop.mapred.SortWriteBufferManager: memoryUsedSize 483183884 is more than 483183808, inSendListBytes 96637115 2025-09-11 19:57:50,195 WARN [main] org.apache.hadoop.mapred.SortWriteBufferManager: memoryUsedSize 483183881 is more than 483183808, inSendListBytes 96637031 Error: org.apache.uniffle.common.exception.RssException: Timeout: failed because 622182 blocks can't be sent to shuffle server in 600000 ms. at org.apache.hadoop.mapred.SortWriteBufferManager.waitSendFinished(SortWriteBufferManager.java:367) at org.apache.hadoop.mapred.RssMapOutputCollector.flush(RssMapOutputCollector.java:240) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:755) at ``` ### Uniffle Server Configurations ```yaml ``` ### Uniffle Engine Configurations ```yaml ``` ### Additional context _No response_ ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@uniffle.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@uniffle.apache.org For additional commands, e-mail: issues-h...@uniffle.apache.org