Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5849#discussion_r29902740
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlSerializer2.scala
 ---
    @@ -323,12 +316,12 @@ private[sql] object SparkSqlSerializer2 {
        */
       def createDeserializationFunction(
           schema: Array[DataType],
    -      in: DataInputStream,
    -      mutableRow: SpecificMutableRow): () => Unit = {
    +      in: DataInputStream): () => Row = {
         () => {
           // If the schema is null, the returned function does nothing when it 
get called.
           if (schema != null) {
             var i = 0
    +        val mutableRow = new GenericMutableRow(schema.length)
    --- End diff --
    
    @yhuai and I chatted about this offline.  The reason that we need to 
perform this copy is because this patch allows SqlSerializer2 to be used in 
cases where the shuffle performs a sort.  In HashShuffleReader, Spark ends up 
passing the iterator returned from this deserializer to ExternalSorter, which 
buffers rows because it needs to sort them based on their contents.
    
    I think that we only need to copy the row in cases where we're shuffling 
with a key ordering.  To avoid unnecessary copying in other cases, I think that 
we can extend `SparkSqlSerializer2`'s constructor to accept a boolean flag that 
indicates whether we should copy, and should thread that flag all the way down 
to here.  In `Exchange`, where we create the serializer, we can check whether 
the shuffle will use a keyOrdering; if it does, then we'll enable copying.  
Avoiding this copy in other cases should provide a nice performance boost for 
aggregation queries.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to