[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...

sameeragarwal Mon, 29 Jan 2018 01:15:14 -0800

Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/20393
  
    @mridulm one approach that Xingbo is looking into (independently of 
https://github.com/apache/spark/pull/20414) is to have the 
`ShuffleBlockFetcherIterator` remember the order of blocks it fetches and store 
them in that order. Given that the blocks will still be fetched in parallel, 
depending on the available buffer size, we'll then have to spill some 
out-of-order blocks on disk in order to avoid OOMs on the receiver (similar to 
https://github.com/apache/spark/pull/16989). While this would still regress 
performance, it might be better than the current local sort based fix. Note 
that I'm not arguing against the fact that hash partitioning would be the 
"best" fix in terms of performance, but it'd then defeat the purpose of 
repartition (due to skew).



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...

Reply via email to