[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

cloud-fan Fri, 13 Jul 2018 19:24:27 -0700

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    OK we can treat it as a data loss. However, it's not caused by spark but by 
the user himself. If a user calls `zip` and then using a custom function to 
compute keys from the zipped pairs, and finally call `groupByKey`, there is 
nothing Spark can guarantee if the RDDs are unsorted. I think in this case the 
user should fix his business logic, Spark does nothing wrong on this. Even if 
the tasks never fail, the users can still get different result/cardinality if 
he runs his query multiple times.
    
    `repartition` is different because the user's business logic is nothing 
wrong: he just wants to repartition the data, Spark should not 
add/remove/update the existing records.
    
    Anyway if we do want to "fix" the `zip` problem, I think this should be a 
different topic: we would need to write all the input data to somewhere and 
make sure the retired task can get exactly same input, which is very expensive 
and very different from this approach.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to