FMX commented on code in PR #2975:
URL: https://github.com/apache/celeborn/pull/2975#discussion_r1872483617
##########
worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/PartitionDataWriter.java:
##########
@@ -416,7 +416,13 @@ public void write(ByteBuf data) throws IOException {
}
data.retain();
- flushBuffer.addComponent(true, data);
+ try {
+ flushBuffer.addComponent(true, data);
+ } catch (OutOfMemoryError oom) {
Review Comment:
This makes no sense because the flush buffer has some shuffle data that was
received earlier, and then OOM happens. The shuffle data is lost even after
committing files.
This would cause data correctness problems here.
That's why we never try to resume Celeborn workers after OOM happens.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]