[ 
https://issues.apache.org/jira/browse/SPARK-4479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220229#comment-14220229
 ] 

Patrick Wendell commented on SPARK-4479:
----------------------------------------

I did some research into this. In the long term what we probably want is an 
ability for a shuffle dependency to include information about whether objects 
from the source RDD must be copied if they are buffered. This would require 
having some Copyable interface (or maybe doing something that is not type safe 
and assuming if someone selects this there is a copy() method).

In the short term the best way to do this is to have the exchange operator 
anticipate whether the rows will be buffered. The code where the sort based 
shuffle will bypass buffering is here:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L134

The Exchange operator can be modified to only copy if the number of partitions 
is > 200 or there is an ordering (during a sort). We need to comment clearly in 
the Exchange operator and the External sorter that this logical dependency 
exists.

This at least means in the out-of-the-box case there will be no regression.

> Avoid unnecessary defensive copies when Sort based shuffle is on
> ----------------------------------------------------------------
>
>                 Key: SPARK-4479
>                 URL: https://issues.apache.org/jira/browse/SPARK-4479
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Michael Armbrust
>            Assignee: Cheng Lian
>            Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to