gaoyajun02 commented on PR #46934:
URL: https://github.com/apache/spark/pull/46934#issuecomment-2176170520

   The above example scenario was determined through metric collection and 
localization in our production environment. However, there are still many 
inconsistencies in mapId at the application layer that cannot be explained, and 
this PR cannot guarantee the final consistency of shuffle data. These service 
nodes (which account for a very small percentage of the cluster nodes, 0.1%) 
have common file system errors at the system level, and there is a small 
probability of data loss cases occurring daily. Given these types of issues, my 
current solution is to consider rolling back the entire reduce partition data, 
not just the inconsistent mapIds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to