littlexyw opened a new pull request, #3224:
URL: https://github.com/apache/celeborn/pull/3224

   …lushBuffer
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] 
Your PR title ...'.
     - Be sure to keep the PR description updated to reflect all changes.
     - Please write your PR title to summarize what this PR proposes.
     - If possible, provide a concise example to reproduce the issue for a 
faster review.
   -->
   
   ### What changes were proposed in this pull request?
   Remove the redundant release of data after OutOfDirectMemoryError appears in 
flushBuffer.addComponent
   
   
   
   ### Why are the changes needed?
   The reason why OutOfDirectMemoryError will appear in 
flushBuffer.addComponent is that after adding a new component, CompositeByteBuf 
will determine whether the number of components exceeds the maximum limit. If 
it exceeds, the existing components will be merged into a large component. At 
this time, new off-heap memory will be requested. If there is insufficient 
memory at this time, OutOfDirectMemoryError will be reported, but the new 
component has been added to flushBuffer at this time. Releasing the new 
component at this time will cause refcnt error.
   Don't worry about the component here not being released causing memory 
leaks, because it will be released normally in returnBuffer (flush or file 
destroy or file close).
   If writeLocalData does not catch OutOfDirectMemoryError, the impact is as 
follows:
   1. In the case of a single copy, if 
https://github.com/apache/celeborn/pull/3049 pr is not merged, commitfile will 
be blocked in waitPendingWrites and fail, because writeLocalData does not 
correctly decrementPendingWrites. However, this will not cause flushBuffer to 
exist in memory for a long time, because when shuffle expires, the file will be 
destroyed, flushBuffer will be returned, and this part of memory will be 
released.
   2. In the case of dual replicas, in addition to the above problems, the 
thread of the Eventloop to which replicate-client belongs will be blocked at 
Await.result(writePromise.future, Duration.Inf) because writePromise is not 
closed correctly. As a result, this thread will not process other PushData data 
written by PushServer to the channels of the Eventloop to which 
replicate-client belongs. This part of data accumulates in the taskQueue of 
EventLoop and cannot be canceled, which is the cause of memory leak.
   Therefore, if the memory leak occurs after OutOfDirectMemoryError occurs in 
flushBuffer.addComponent, you only need to catch OutOfDirectMemoryError in 
writeLocalData, and there is no need to release data after addComponent.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   manual test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to