[
https://issues.apache.org/jira/browse/SPARK-4479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220229#comment-14220229
]
Patrick Wendell commented on SPARK-4479:
----------------------------------------
I did some research into this. In the long term what we probably want is an
ability for a shuffle dependency to include information about whether objects
from the source RDD must be copied if they are buffered. This would require
having some Copyable interface (or maybe doing something that is not type safe
and assuming if someone selects this there is a copy() method).
In the short term the best way to do this is to have the exchange operator
anticipate whether the rows will be buffered. The code where the sort based
shuffle will bypass buffering is here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L134
The Exchange operator can be modified to only copy if the number of partitions
is > 200 or there is an ordering (during a sort). We need to comment clearly in
the Exchange operator and the External sorter that this logical dependency
exists.
This at least means in the out-of-the-box case there will be no regression.
> Avoid unnecessary defensive copies when Sort based shuffle is on
> ----------------------------------------------------------------
>
> Key: SPARK-4479
> URL: https://issues.apache.org/jira/browse/SPARK-4479
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Reporter: Michael Armbrust
> Assignee: Cheng Lian
> Priority: Blocker
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]