[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

jiangxb1987 Wed, 11 Jul 2018 09:13:37 -0700

Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    > A synthetic example:
    > rdd1.zip(rdd2).map(v => (computeKey(v._1, v._2), computeValue(v._1, 
v._2))).groupByKey().map().save()
    
    The above example may create some different output when retrying a subset 
of all the tasks. But I may not call it a data loss or data correctness issue. 
Let's image you run the query twice, each with different ordering of `rdd1` and 
`rdd2`, each run shall produce different outputs (even different # of output 
rows). The result produced by retrying a subset of all the tasks is still 
valid, it actually correspond to another input data representation,  though not 
the same as the initial input.
    
    Now I tend to believe there will not be data loss or data correctness 
issue, as long as you don't spread input data across partitions in a round 
robin way (or, in a way that is not related to the data itself), because on 
task retry you are guaranteed that all input data are covered (each row get 
recomputed exactly once, though maybe in different order).



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to