gaoyajun02 opened a new pull request, #46934:
URL: https://github.com/apache/spark/pull/46934

   ### What changes were proposed in this pull request?
   Add consistency check for mapIds between the push merged block meta from the 
server side and the partition level bitmap on the driver side for reduce tasks. 
If any mapIds are found missing, fallback to fetching original shuffle blocks.
   This end-to-end check helps to avoid issues of data loss during the shuffle 
read phase when reduce tasks fetch merged data.
   
   ### Why are the changes needed?
   
   ShuffleBlockFetcherIterator initializes requests based on the merge status 
and map status from the driver side, where the merge status's partition level 
bitmap (mapIds) comes from the mapTracker maintained in the shuffle service's 
memory.
   But the actual mapIds for fetching chunk data come from the shuffle 
service's metaFile. There is no consistency check between the two. 
   When the server encounters issues such as disk failures, it may lead to 
inconsistencies in mapIds between the mapTracker and the metaFile. This 
ultimately results in data loss when reduce tasks fetch merged data.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   UT
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to