jiang13021 commented on PR #3534: URL: https://github.com/apache/celeborn/pull/3534#issuecomment-3568756603
> Nice catch. We did have the same problem, but I don’t know what scenarios will cause data loss, have you found the root cause? @jiang13021 The reason for data loss in this issue is that the compressed data is corrupted. However, we are not yet clear about the root cause of the data corruption, and we only know that this is an occasional phenomenon. If the compressed body is corrupted, it will directly trigger stage rerun. However, this time it is the rarer case of header corruption. After merging this PR, header corruption will also trigger stage rerun. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
