duanmeng commented on pull request #29907:
URL: https://github.com/apache/spark/pull/29907#issuecomment-703071440


   > It should be a bug. I thought you were able to reproduce it because you 
said "I added some logs and reproduce the issue." I may misunderstand it.
   
   We have have two identical spark app in two clusters and compare results 
(it's a critical task so we have to double check its result).  There might be 
diffs caused by data lost every 3-4 weeks. I add logs in the bypass shuffle 
writer and caught this issue when it reproduced.
   
   However its root cause should be the disk or kernel in our cluster (disk 
pressue is very high), which happens occasionlly. `spark.shuffle.sync` could be 
used to force sync writes but it have performance impact, so I suggest to add 
length checking.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to