[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

tgravescs Fri, 10 Aug 2018 13:27:25 -0700

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21698
  
    thinking about the sorting thing again but I don't think it works unless 
you sort both the keys and the values themselves. 
    
    for instance lets say we have a groupby key which generates:
    B = k1, { (a11, a12, a13), (a21, a22, a23), (a11, a12, a13)}
    Then we do a distinct on it, you could get:
    C =((a11, a12, a13) ,(a21, a22, a23))
    
    But say you get fetch failures and you refetch the data.  Since the reducer 
can fetch it in any order during the shuffle,  B could be 
    B = k1, { (a21, a22, a23), (a11, a12, a13), (a11, a12, a13)}
    and distinct you end up with 
     C =((a21, a22, a23), (a11, a12, a13))
    
    If you were to just sort that C could be in a different place and then 
round robin the output, the above example of C would go to different partitions.
    
    I'm not sure of a way to get around this without using the hash partitioner 
but still looking into it.  I need to look at the details about how the 
dataframe  pr solved this as well as I'm curious how it works with above 
example.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to