[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

mridulm Fri, 17 Aug 2018 10:28:28 -0700

Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/22112
  
    @tgravescs Please see 
https://github.com/apache/spark/pull/22112#discussion_r210788359 for a further 
elaboration. We actually cannot support random order (except for small subset 
of cases like map-only jobs for example).
    Ideally, I would like to see order sensitive closure's fixed - and fixing 
repartition + shuffle would fix this for a general case for all order sensitive 
closures.
    This PR is not fixing the problem, but rather failing and re-trying the job 
as a workaround - which, as you mention, can be terribly expensive for large 
jobs. Ofcourse, data correctness trump's performance, so I am fine with this as 
stop-gap. I would expect most non trivial application's will simply workaround 
this by checkpoint'ing to hdfs like what we did in YST.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to