Josh Rosen created SPARK-7375:
---------------------------------

             Summary: Avoid defensive copying in SQL exchange operator when 
sort-based shuffle buffers data in serialized form
                 Key: SPARK-7375
                 URL: https://issues.apache.org/jira/browse/SPARK-7375
             Project: Spark
          Issue Type: Bug
          Components: SQL
            Reporter: Josh Rosen


The original sort-based shuffle buffers shuffle input records in memory while 
sorting them. This causes problems when mutable records are presented to the 
shuffle, which happens in Spark SQL's Exchange operator. To work around this 
issue, SPARK-2967 and SPARK-4479 added defensive copying of shuffle inputs in 
the Exchange operator when sort-based shuffle is enabled.

I think that [~sandyr]'s recent patch for enabling serialization of records in 
sort-based shuffle (SPARK-4550) and my proposed {{unsafe}}-based shuffle path 
(SPARK-7081) may allow us to avoid this defensive copying in certain cases 
(since our patches cause records to be serialized one-at-a-time and remove the 
buffering of deserialized records).

As mentioned in SPARK-4479, a long-term fix for this issue might be to add 
hooks for informing the shuffle about object (im)mutability in order to allow 
the shuffle layer to decide whether to copy. In the meantime, though, I think 
that we should just extend the checks added in SPARK-4479 to avoid copies when 
these new serialized sort paths are used.

/cc [~rxin] [~marmbrus] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to