littlexyw opened a new pull request, #3224:
URL: https://github.com/apache/celeborn/pull/3224
…lushBuffer
<!--
Thanks for sending a pull request! Here are some tips for you:
- Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX]
Your PR title ...'.
- Be sure to keep the PR description updated to reflect all changes.
- Please write your PR title to summarize what this PR proposes.
- If possible, provide a concise example to reproduce the issue for a
faster review.
-->
### What changes were proposed in this pull request?
Remove the redundant release of data after OutOfDirectMemoryError appears in
flushBuffer.addComponent
### Why are the changes needed?
The reason why OutOfDirectMemoryError will appear in
flushBuffer.addComponent is that after adding a new component, CompositeByteBuf
will determine whether the number of components exceeds the maximum limit. If
it exceeds, the existing components will be merged into a large component. At
this time, new off-heap memory will be requested. If there is insufficient
memory at this time, OutOfDirectMemoryError will be reported, but the new
component has been added to flushBuffer at this time. Releasing the new
component at this time will cause refcnt error.
Don't worry about the component here not being released causing memory
leaks, because it will be released normally in returnBuffer (flush or file
destroy or file close).
If writeLocalData does not catch OutOfDirectMemoryError, the impact is as
follows:
1. In the case of a single copy, if
https://github.com/apache/celeborn/pull/3049 pr is not merged, commitfile will
be blocked in waitPendingWrites and fail, because writeLocalData does not
correctly decrementPendingWrites. However, this will not cause flushBuffer to
exist in memory for a long time, because when shuffle expires, the file will be
destroyed, flushBuffer will be returned, and this part of memory will be
released.
2. In the case of dual replicas, in addition to the above problems, the
thread of the Eventloop to which replicate-client belongs will be blocked at
Await.result(writePromise.future, Duration.Inf) because writePromise is not
closed correctly. As a result, this thread will not process other PushData data
written by PushServer to the channels of the Eventloop to which
replicate-client belongs. This part of data accumulates in the taskQueue of
EventLoop and cannot be canceled, which is the cause of memory leak.
Therefore, if the memory leak occurs after OutOfDirectMemoryError occurs in
flushBuffer.addComponent, you only need to catch OutOfDirectMemoryError in
writeLocalData, and there is no need to release data after addComponent.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
manual test.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]