[PR] [SPARK-49919][SQL] Add special limits support for return content as JSON dataset [spark]

via GitHub Thu, 10 Oct 2024 00:06:56 -0700


LantaoJin opened a new pull request, #48407:
URL: https://github.com/apache/spark/pull/48407


   ### What changes were proposed in this pull request?
   `CollectLimitExec` is used when a logical `Limit` and/or `Offset` operation 
is the final operator. Comparing to `GlobalLimitExec`, it can avoid shuffle 
data to a single output partition.
   But when the dataset is collected as a Dataset of JSON strings. The 
`GlobalLimitExec` and `TakeOrderedAndProjectExec` are not able to applied since 
the SpecialLimits strategy cannot work as expected.
   Here is an example:
   ```
   scala> spark.sql("select * from right_t limit 4").explain
   == Physical Plan ==
   CollectLimit 4
   +- Scan hive spark_catalog.default.right_t [id#23, name#24], 
HiveTableRelation [`spark_catalog`.`default`.`right_t`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#23, 
name#24], Partition Cols: []]
   
   scala> spark.sql("select * from right_t limit 4").toJSON.explain
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, 
java.lang.String, true], true, false, true) AS value#34]
      +- MapPartitions 
org.apache.spark.sql.Dataset$$Lambda/0x00000070021d8c58@5b17838a, obj#33: 
java.lang.String
         +- DeserializeToObject createexternalrow(staticinvoke(class 
java.lang.Integer, ObjectType(class java.lang.Integer), valueOf, id#27, true, 
false, true), name#28.toString, StructField(id,IntegerType,true), 
StructField(name,StringType,true)), obj#32: org.apache.spark.sql.Row
            +- GlobalLimit 4, 0
               +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=42]
                  +- LocalLimit 4
                     +- Scan hive spark_catalog.default.right_t [id#27, 
name#28], HiveTableRelation [`spark_catalog`.`default`.`right_t`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#27, 
name#28], 
   ```
   
   ### Why are the changes needed?
   Without this patching, the simple query "select limit" or "select sort 
limit" has to introduce a shuffle when return  content as JSON dataset. Both 
`CollectLimitExec` and `TakeOrderedAndProject` cannot be applied.
   `Dataset.toJSON` as a fundamental API causes to poor performance in many 
scenarios. 
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   UT added.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-49919][SQL] Add special limits support for return content as JSON dataset [spark]

Reply via email to