pan3793 edited a comment on pull request #32385:
URL: https://github.com/apache/spark/pull/32385#issuecomment-1053571929


   Hi @Ngone51, thanks for providing the checksum feature for shuffle, have a 
thought about the following case.
   
   > if the corruption happens inside BufferReleasingInputStream, the reducer 
will throw the fetch failure immediately no matter what the cause is since the 
data has been partially consumed by downstream RDDs.
   
   I think the checksum could handle this case. Pick the idea from #28525, we 
can consume the input stream inside `ShuffleBlockFetcherIterator` and verify 
the checksum immediately, then
   
   > If it's a disk issue or unknown, it will throw fetch failure directly. If 
it's a network issue, it will re-fetch the block later. 
   
   To achieve this, we may need to update the shuffle network protocol to 
support passing checksums when fetching shuffle blocks.
   
   Compare to the existing `Utils.copyStreamUpTo(input, maxBytesInFlight / 3)`, 
this approach use less memory and can verify any size of blocks, but it 
introduce another overhead because it need to read the data 2 times.
   
   We encounter this issues in our production everyday, for some jobs, the 
performance overhead is acceptable comparing to stability.
   
   @Ngone51 WDYT?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to