Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/21698
@cloud-fan We should not look at a particular stage in isolation, but
rather what happens when there are failures in the middle of a job with
multiple shuffle stages - and zip is one of the internal stages.
A synthetic example:
`rdd1.zip(rdd2).map(v => (computeKey(v._1, v._2), computeValue(v._1,
v._2))).groupByKey().map().save()`
If relative ordering of rdd1 or rdd2 changes, the computed key would change
- and we end up with data loss if some of the tasks in save have already
completed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]