Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/21698 > A synthetic example: > rdd1.zip(rdd2).map(v => (computeKey(v._1, v._2), computeValue(v._1, v._2))).groupByKey().map().save() The above example may create some different output when retrying a subset of all the tasks. But I may not call it a data loss or data correctness issue. Let's image you run the query twice, each with different ordering of `rdd1` and `rdd2`, each run shall produce different outputs (even different # of output rows). The result produced by retrying a subset of all the tasks is still valid, it actually correspond to another input data representation, though not the same as the initial input. Now I tend to believe there will not be data loss or data correctness issue, as long as you don't spread input data across partitions in a round robin way (or, in a way that is not related to the data itself), because on task retry you are guaranteed that all input data are covered (each row get recomputed exactly once, though maybe in different order).
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org