[GitHub] [spark] pan3793 edited a comment on pull request #35076: [SPARK-37793][CORE][SHUFFLE] Fallback to fetch original blocks when noLocalMergedBlockDataError

GitBox Tue, 04 Jan 2022 20:24:58 -0800


pan3793 edited a comment on pull request #35076:
URL: https://github.com/apache/spark/pull/35076#issuecomment-1005368068



   @otterc I agree with you that `bufs` should not be empty in design, and 
#34934 also does. 
   
   I also suspect there are some bugs or concurrence issues in code, and add 
some assertions, but unfortunately, nothing was found. 
   
   Besides those 2 issues, I also met the shuffle data corrupt issues 
frequently.
   ```
   Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
to stage failure: Task 144 in stage 1921.0 failed 4 times, most recent failure: 
Lost task 144.3 in stage 1921.0 (TID 139025) (beta-spark4 executor 85): 
java.io.EOFException: reached end of stream after reading 46 bytes; 48 bytes 
expected
        at org.sparkproject.guava.io.ByteStreams.readFully(ByteStreams.java:735)
        at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:127)
        at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
           ...
   ```
   Both hardware(disk) issue and network issue may cause shuffle data 
corruption, and due to the lack of checksum mechanism of push-based shuffle, 
there is a chance we pass the corrupt data to `xxSerializer` layer, then cause 
exception and fail the task.
   
   So I think except to the code bug, there still has opportunity to read the 
corrupt metadata from disk/network, even the possibility is lower than shuffle 
data because metadata usually smaller, and when it happens, fallback to fetch 
the original blocks should be safe.
   
   With this patch and #34934, the data corruption is the only critical issue(I 
mean can fail the job) in our dozen rounds of 1T TPC-DS test, and I think add 
the checksum should solve that issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] pan3793 edited a comment on pull request #35076: [SPARK-37793][CORE][SHUFFLE] Fallback to fetch original blocks when noLocalMergedBlockDataError

Reply via email to