[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

mridulm Mon, 29 Jan 2018 10:42:13 -0800

Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/20414
  
    @shivaram Thinking more, this might affect everything which does a zip (or 
variants/similar idioms like limit K, etc) on partition should be affected - 
with random + index in coalesce + shuffle=true being one special case.
    
    Essentially anything which assumes that order of records in a partition 
will always be the same - currently,
    * Reading from an external immutable source like hdfs, etc (including 
checkpoint)
    * Reading from block store
    * Sorted partitions 
    should guarantee this - others need not.
    
    The more I think about it, I like @sameeragarwal's suggestion in #20393, a 
general solution for this could be introduce deterministic output for shuffle 
fetch - when enabled takes a more expensive but repeatable iteration of shuffle 
fetch.
    
    This assumes that spark shuffle is always repeatable given same input (I am 
yet to look into this in detail when spills are involved - any thoughts 
@sameeragarwal ?), which could be an implementation detail; but we could make 
it a requirement for shuffle.
    
    Note that we might be able to avoid this additional cost for most of the 
current usecases (otherwise we would have faced this problem 2 major releases 
ago !); so actual user impact, hopefully, might not be as high.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

Reply via email to