Josh Rosen created SPARK-7375:
---------------------------------
Summary: Avoid defensive copying in SQL exchange operator when
sort-based shuffle buffers data in serialized form
Key: SPARK-7375
URL: https://issues.apache.org/jira/browse/SPARK-7375
Project: Spark
Issue Type: Bug
Components: SQL
Reporter: Josh Rosen
The original sort-based shuffle buffers shuffle input records in memory while
sorting them. This causes problems when mutable records are presented to the
shuffle, which happens in Spark SQL's Exchange operator. To work around this
issue, SPARK-2967 and SPARK-4479 added defensive copying of shuffle inputs in
the Exchange operator when sort-based shuffle is enabled.
I think that [~sandyr]'s recent patch for enabling serialization of records in
sort-based shuffle (SPARK-4550) and my proposed {{unsafe}}-based shuffle path
(SPARK-7081) may allow us to avoid this defensive copying in certain cases
(since our patches cause records to be serialized one-at-a-time and remove the
buffering of deserialized records).
As mentioned in SPARK-4479, a long-term fix for this issue might be to add
hooks for informing the shuffle about object (im)mutability in order to allow
the shuffle layer to decide whether to copy. In the meantime, though, I think
that we should just extend the checks added in SPARK-4479 to avoid copies when
these new serialized sort paths are used.
/cc [~rxin] [~marmbrus] [~yhuai]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]