This is an automated email from the ASF dual-hosted git repository.
zuston pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-uniffle.git
The following commit(s) were added to refs/heads/master by this push:
new e0434791a [#1459] fix(server): Memory leak for exceptional scenarios
when flushing events (#1537)
e0434791a is described below
commit e0434791a9ff206327bbdaab0d5081ffe28da429
Author: RickyMa <[email protected]>
AuthorDate: Sat Feb 24 16:30:31 2024 +0800
[#1459] fix(server): Memory leak for exceptional scenarios when flushing
events (#1537)
### What changes were proposed in this pull request?
Memory may leak in exceptional scenarios when flushing events
### Why are the changes needed?
A follow-up PR for: https://github.com/apache/incubator-uniffle/pull/1459
I found out that memory may leak in exceptional scenarios when flushing
events.
<img width="560" alt="企业微信截图_17086201359387"
src="https://github.com/apache/incubator-uniffle/assets/13834479/e7b77e82-4336-4123-b59a-621346edd613">
<img width="811" alt="企业微信截图_1708620172540"
src="https://github.com/apache/incubator-uniffle/assets/13834479/931eeb38-0c1b-4cfb-83be-485b74c10e17">
Because the following code snippets have not been executed:
```
event.addCleanupCallback(
() -> {
this.clearInFlushBuffer(event.getEventId());
spBlocks.forEach(spb -> spb.getData().release());
}
);
```
and
```
event.addCleanupCallback(() -> releaseMemory(event.getSize(), true, false));
```
It can also lead to heap memory not being released, as
`HybridStorageManager`.`eventOfUnderStorageManagers` will hold a large amount
of unreleased heap memory. This is because the code
`event.addCleanupCallback(() -> eventOfUnderStorageManagers.invalidate(event))`
has not been executed.
<img width="986" alt="企业微信截图_17086622982256"
src="https://github.com/apache/incubator-uniffle/assets/13834479/98721112-8805-4f40-83ef-8daa463b6547">
<img width="454" alt="企业微信截图_17086635217020"
src="https://github.com/apache/incubator-uniffle/assets/13834479/e54b9e77-33de-44dd-b727-55c41bdfc95f">
After this PR, the memory will not leak and the exception's stack trace
will be something like:
[2024-02-23 14:25:02.614] [LocalFileFlushEventThreadPool-54] [ERROR]
DefaultFlushEventHandler.handleEventAndUpdateMetrics - Unexpected exceptions
happened due to
java.lang.NullPointerException
at
org.apache.uniffle.server.ShuffleTaskInfo.getMaxConcurrencyPerPartitionToWrite(ShuffleTaskInfo.java:109)
at
org.apache.uniffle.server.ShuffleFlushManager.getMaxConcurrencyPerPartitionWrite(ShuffleFlushManager.java:198)
at
org.apache.uniffle.server.ShuffleFlushManager.processFlushEvent(ShuffleFlushManager.java:149)
at
org.apache.uniffle.server.DefaultFlushEventHandler.handleEventAndUpdateMetrics(DefaultFlushEventHandler.java:87)
at
org.apache.uniffle.server.DefaultFlushEventHandler.lambda$dispatchEvent$0(DefaultFlushEventHandler.java:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
We can talk about the above exception in another issue after this PR is
merged. This PR is focused on fixing potential memory leaks.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing UTs.
---------
Co-authored-by: Enrico Minack <[email protected]>
---
.../java/org/apache/uniffle/server/DefaultFlushEventHandler.java | 7 +++++++
1 file changed, 7 insertions(+)
diff --git
a/server/src/main/java/org/apache/uniffle/server/DefaultFlushEventHandler.java
b/server/src/main/java/org/apache/uniffle/server/DefaultFlushEventHandler.java
index 1d4f43ca5..c5b320091 100644
---
a/server/src/main/java/org/apache/uniffle/server/DefaultFlushEventHandler.java
+++
b/server/src/main/java/org/apache/uniffle/server/DefaultFlushEventHandler.java
@@ -124,8 +124,15 @@ public class DefaultFlushEventHandler implements
FlushEventHandler {
}
if (e instanceof EventInvalidException) {
+ // Invalid events have already been released / cleaned up
+ // so no need to call event.doCleanup() here
return;
}
+
+ LOG.error(
+ "Unexpected exceptions happened when handling the flush event: {},
due to ", event, e);
+ // We need to release the memory when unexpected exceptions happened
+ event.doCleanup();
} finally {
if (storage != null) {
if (storage instanceof HadoopStorage) {