Github user jiangxb1987 commented on the issue:
https://github.com/apache/spark/pull/21698
> A synthetic example:
> rdd1.zip(rdd2).map(v => (computeKey(v._1, v._2), computeValue(v._1,
v._2))).groupByKey().map().save()
The above example may create some different output when retrying a subset
of all the tasks. But I may not call it a data loss or data correctness issue.
Let's image you run the query twice, each with different ordering of `rdd1` and
`rdd2`, each run shall produce different outputs (even different # of output
rows). The result produced by retrying a subset of all the tasks is still
valid, it actually correspond to another input data representation, though not
the same as the initial input.
Now I tend to believe there will not be data loss or data correctness
issue, as long as you don't spread input data across partitions in a round
robin way (or, in a way that is not related to the data itself), because on
task retry you are guaranteed that all input data are covered (each row get
recomputed exactly once, though maybe in different order).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]