Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/21698
OK we can treat it as a data loss. However, it's not caused by spark but by
the user himself. If a user calls `zip` and then using a custom function to
compute keys from the zipped pairs, and finally call `groupByKey`, there is
nothing Spark can guarantee if the RDDs are unsorted. I think in this case the
user should fix his business logic, Spark does nothing wrong on this. Even if
the tasks never fail, the users can still get different result/cardinality if
he runs his query multiple times.
`repartition` is different because the user's business logic is nothing
wrong: he just wants to repartition the data, Spark should not
add/remove/update the existing records.
Anyway if we do want to "fix" the `zip` problem, I think this should be a
different topic: we would need to write all the input data to somewhere and
make sure the retired task can get exactly same input, which is very expensive
and very different from this approach.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]