[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

mridulm Mon, 29 Jan 2018 11:31:41 -0800

Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/20414
  
    @jiangxb1987 Unfortunately I am unable to analyze this in detail; but 
hopefully can give some pointers, which I hope, helps !
    
    One example I can think of is, for shuffle which uses Aggregator (like 
combineByKey), via ExternalAppendOnlyMap.
    The order in which we replay the keys with the same hash is non 
deterministic from what I remember - for example if first run did not result in 
any spills, second run had 3 spills and third run had 7, the order of keys 
(with same hash) could be different in each.
    
    Similarly, with sort based shuffle, depending on the length of the data 
array in AppendOnlyMap (which is determined by whether we spilt or not) we can 
get different sort order's ?
    Similarly for the actual sort itself, the `merge` quite clearly is 
sensitive to number of spills (for example when no aggregator or ordering, it 
is simply `iterators.iterator.flatten`).
    
    There might be other cases where this is happening - I have not regularly 
looked at this part of the codebase in a while now unfortunately.
    
    Please note that all the cases above, there is no ordering defined.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

Reply via email to