[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

tgravescs Mon, 20 Aug 2018 08:40:04 -0700

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/22112
  
    @mridulm Thanks for pointing out that comment, I hadn't seen it, its a very 
nice write up.  I don't agree that " We actually cannot support random output". 
 Users can do this now in MR and spark and we can't really stop them other then 
say we don't support and if you do failure handling will cause different 
results.
    
    > Ideally, I would like to see order sensitive closure's fixed - and fixing 
repartition + shuffle would fix this for a general case for all order sensitive 
closures.  
    
    This is what I'm trying to get at.  We need to make this decision as to 
whether we truly want to fix these or just give the user a warning about them.  
Or perhaps its a combination depending on what it is.  I agree that most likely 
whatever fix we make for repartition could be applied to the others as well.  I 
don't want us to document it away now and then change our mind in next release. 
  Our end decision should be final.    
    
    Note that so far I don't  see how this temporary workaround can be extended 
for ResultTasks.  The only way I truly see this is to make it sorted.  Which 
like you said I think in spark is very expensive and you would have to force it 
because you don't know if it would fail or not.   It also may only apply to 
certain operations we control like zip.
    
    > There should be a lot of usage within mllib (atleast yahoo's internal 
BigML library did have a lot of it).
    
    Do you know more specifics here, what are the ML libraries doing to cause 
this, some sort of sampling or is it just the algorithm generates slightly 
different on the order of shuffle?




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to