[I] [Bug] [MR] Enable Map-Stage Combiner by Default Causes Severe GC and Job Failures [uniffle]

via GitHub Thu, 11 Sep 2025 05:46:57 -0700


Lobo2008 opened a new issue, #2606:
URL: https://github.com/apache/uniffle/issues/2606

### Code of Conduct

- [x] I agree to follow this project's [Code of
Conduct](https://www.apache.org/foundation/policies/conduct)

### Search before asking

- [x] I have searched in the
[issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and
found no similar issues.

### Describe the bug

### What is the problem?

When using Uniffle with MapReduce jobs that have a Combiner, jobs often fail
to complete. Analysis shows this is caused by severe GC overhead introduced by
the map-stage combiner feature (#1301).

### Root Cause Analysis

1. **Architectural Mismatch**: The current implementation inherits the
combiner logic from Hadoop's MapOutputBuffer, but does not consider the
fundamental difference between disk-shuffle (Hadoop) and remote-shuffle
(Uniffle).
2. **Execution Granularity**: In Hadoop, the combiner runs per-spill
(default ~80MB).
In Uniffle, the combiner runs on the entire send buffer, whose size is
determined by mapreduce.task.io.sort.mb ×
mapreduce.rss.client.sort.memory.use.threshold.
Even when this value is large (e.g., 512MB), jobs succeed with native ESS or
when the combiner is disabled. Enabling the map-stage combiner is the trigger
for severe GC overhead and job hangs, indicating the problem is intrinsic to
the combiner logic, not the buffer size itself.
3. **GC Storm**: Processing hundreds of MBs of data in a single combiner
run overwhelms the JVM heap, causing prolonged Full GC pauses.
4. **Deadlock Chain**:
- GC pauses stall the sender threads.
- Stalled senders cannot free up memory (inSendListBytes remains high).
- MapTask's collect() thread hits the memory limit and gets blocked
(memoryUsedSize XXX is more than YYY warning).
- This creates a deadlock: MapTask waits for memory, while sender
threads wait for GC/combiner.

### Evidence

- **Log Evidence**: Constant WARN logs: `"2025-09-10 19:28:58,789 WARN
[main] org.apache.hadoop.mapred.SortWriteBufferManager: memoryUsedSize
483183871 is more than 483183808, inSendListBytes 96636834`
- **Code Evidence**: The combiner is unconditionally executed in
`SortWriteBufferManager.prepareBufferForSend()`.
- **Industry Practice**: Apache Celeborn, another major RSS, explicitly
**does NOT** implement a map-stage combiner in its client
(`CelebornMapOutputCollector`), focusing solely on efficient data pushing.

### Expected Behavior

The map-stage combiner should be an **opt-in** feature for expert users, not
an **opt-out** one that causes default instability. Most users should enjoy a
stable, reliable job run by default.

### How to reproduce?

Run any MapReduce job with a non-trivial Combiner and significant data
volume on Uniffle. The job will likely hang or progress extremely slowly, with
GC logs showing 90%+ GC overhead.

### Proposed Solution

Introduce a configuration parameter (e.g.,
`mapreduce.rss.client.combiner.enable`) to control this feature. **The default
value must be `false`** to ensure stability.

This solution is:
- **Simple and safe**: A minimal code change.
- **User-friendly**: Provides stability by default, flexibility for experts.
- **Aligned with RSS philosophy**: Offloads heavy computation from compute
nodes.

### Affects Version(s)

master

### Uniffle Server Log Output

```logtalk
[2025-09-09 15:03:24.179] [checkResource-0] [WARN] ShuffleTaskManager -
Remove expired preAllocatedBuffer[id=267241985] that required by app:
appattempt_1755665307636_291096_000001
```

### Uniffle Engine Log Output

```logtalk
2025-09-11 19:57:50,195 WARN [main]
org.apache.hadoop.mapred.SortWriteBufferManager: memoryUsedSize 483183865 is
more than 483183808, inSendListBytes 96636783
2025-09-11 19:57:50,195 WARN [main]
org.apache.hadoop.mapred.SortWriteBufferManager: memoryUsedSize 483183884 is
more than 483183808, inSendListBytes 96637115
2025-09-11 19:57:50,195 WARN [main]
org.apache.hadoop.mapred.SortWriteBufferManager: memoryUsedSize 483183881 is
more than 483183808, inSendListBytes 96637031

Error: org.apache.uniffle.common.exception.RssException: Timeout: failed
because 622182 blocks can't be sent to shuffle server in 600000 ms. at
org.apache.hadoop.mapred.SortWriteBufferManager.waitSendFinished(SortWriteBufferManager.java:367)
at
org.apache.hadoop.mapred.RssMapOutputCollector.flush(RssMapOutputCollector.java:240)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:755)
at
```

### Uniffle Server Configurations

```yaml

```

### Uniffle Engine Configurations

```yaml

```

### Additional context

_No response_

### Are you willing to submit PR?

- [x] Yes I am willing to submit a PR!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@uniffle.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@uniffle.apache.org
For additional commands, e-mail: issues-h...@uniffle.apache.org

[I] [Bug] [MR] Enable Map-Stage Combiner by Default Causes Severe GC and Job Failures [uniffle]

Reply via email to