Victsm commented on a change in pull request #29855:
URL: https://github.com/apache/spark/pull/29855#discussion_r499263579
##########
File path:
common/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java
##########
@@ -181,6 +182,17 @@ public void onFailure(Throwable e) {
private void processStreamUpload(final UploadStream req) {
assert (req.body() == null);
try {
+ // Retain the original metadata buffer, since it will be used during the
invocation of
+ // this method. Will be released later.
+ req.meta.retain();
+ // Make a copy of the original metadata buffer. In benchmark, we noticed
that
+ // we cannot respond the original metadata buffer back to the client,
otherwise
+ // in cases where multiple concurrent shuffles are present, a wrong
metadata might
+ // be sent back to client. This is related to the eager release of the
metadata buffer,
+ // i.e., we always release the original buffer by the time the
invocation of this
+ // method ends, instead of by the time we respond it to the client. This
is necessary,
+ // otherwise we start seeing memory issues very quickly in benchmarks.
+ ByteBuffer meta = cloneBuffer(req.meta.nioByteBuffer());
Review comment:
Netty `ByteBuf` is reference counted, and will be deallocated once
`release` calls decreases the reference to 0.
Nio `ByteBuffer` is not, and it's Java reference based and will be
deallocated by GC.
My original attempt was to use retain & release to cover a longer time
period so the `ByteBuf` would live until the callback finishes processing the
stream.
However, during stress test where we stress a single shuffle service with
many concurrent block push requests, we found out that the shuffle service can
fail to allocate additional Netty `ByteBuf` if we delay the release of
`req.meta` buffer.
I can update this PR with that version of the code which does not clone the
`ByteBuf` into a `ByteBuffer` and also attach the error log we saw, in case
there is something missed on my end.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]