[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

tgravescs Wed, 15 Aug 2018 10:57:01 -0700

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    so I think the assumption is that task results are idempotent but not 
ordered.  Sorry if that contradictory.   The data itself has to be the same on 
rerun but the order of things in there doesn't.   That was my general 
assumption.  I think zip doesn't follow that though when the inputs aren't 
ordered.  Not sure if there are others spark supports, need to go through the 
list I guess, unless someone already has?
    
    I think we just need to document these operations and say the results can 
be inconsistent if not sorted or perhaps give them an option to also sort.  
Either that or we have to say we don't support unordered output at all in 
Spark.    Thoughts on just documenting zip or others with unordered input?
    
    I don't think mapreduce and pig have this issue because they don't 
internally support an operation like zip, everything is on key/values and 
joins, groupby on the keys.  User code there could generate it as well but I 
would claim its the users fault there.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to