Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
@jiangxb1987 data loss comes because a re-execution of zip might generate a
key for which corresponding reducer has already finished.
Hence re-execution of stage will not result in subsequent child stage's
reducer partition getting re-executed : resulting in data loss.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]