duanmeng commented on pull request #29907: URL: https://github.com/apache/spark/pull/29907#issuecomment-703071440
> It should be a bug. I thought you were able to reproduce it because you said "I added some logs and reproduce the issue." I may misunderstand it. We have have two identical spark app in two clusters and compare results (it's a critical task so we have to double check its result). There might be diffs caused by data lost every 3-4 weeks. I add logs in the bypass shuffle writer and caught this issue when it reproduced. However its root cause should be the disk or kernel in our cluster (disk pressue is very high), which happens occasionlly. `spark.shuffle.sync` could be used to force sync writes but it have performance impact, so I suggest to add length checking. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
