zwangsheng commented on PR #2536: URL: https://github.com/apache/celeborn/pull/2536#issuecomment-2144174749
Well, seems I misjudged the issue. First, this NPE occurs when the Celeborn Worker has a graceful shutdown and restart operation while spark applications are running. And after restart, we observe from the dashboard that a celeborn worker has an incorrect number of running applications. And task on spark side, will throw RPC fail with NPE(for now, I realize that this error being thrown doesn't seem to have anything to do with this PR, the error was thrown by) https://github.com/apache/celeborn/blob/e5f09ce4e06154c3cadd1ff13a9f304b672e9cdb/worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/ChunkStreamManager.java#L197-L200 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
