gaoyajun02 commented on PR #46934: URL: https://github.com/apache/spark/pull/46934#issuecomment-2176170520
The above example scenario was determined through metric collection and localization in our production environment. However, there are still many inconsistencies in mapId at the application layer that cannot be explained, and this PR cannot guarantee the final consistency of shuffle data. These service nodes (which account for a very small percentage of the cluster nodes, 0.1%) have common file system errors at the system level, and there is a small probability of data loss cases occurring daily. Given these types of issues, my current solution is to consider rolling back the entire reduce partition data, not just the inconsistent mapIds. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
