gaoyajun02 commented on PR #46934: URL: https://github.com/apache/spark/pull/46934#issuecomment-2176119908
Service nodes with disk issues (e.g. No space left on device, Read-only file system) have a large number of logs stating `IOExceptions exceeded the threshold when merging shufflePush`, as well as the following WARN log: ``` INFO application_xxx attempt 1 shuffle 0 shuffleMerge 0: finalize shuffle merge WARN Application application_xxxx shuffleId 0 shuffleMergeId 0 reduceId 133 update to index/meta failed ``` The corresponding method is specifically `updateChunkInfo` -> `writeChunkTracker,` which encounters an IOException when serializing chunkTracker to disk. Based on this information, it can be determined that: during the finalize partition process, writing the serialized chunk to the metaFile failed, but the truncate operation succeeded, meaning the bitmap of the last chunk was deleted from the metaFile, while the mapTracker records the mapId after the block data processing is completed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
