LantaoJin opened a new pull request, #48407:
URL: https://github.com/apache/spark/pull/48407
### What changes were proposed in this pull request?
`CollectLimitExec` is used when a logical `Limit` and/or `Offset` operation
is the final operator. Comparing to `GlobalLimitExec`, it can avoid shuffle
data to a single output partition.
But when the dataset is collected as a Dataset of JSON strings. The
`GlobalLimitExec` and `TakeOrderedAndProjectExec` are not able to applied since
the SpecialLimits strategy cannot work as expected.
Here is an example:
```
scala> spark.sql("select * from right_t limit 4").explain
== Physical Plan ==
CollectLimit 4
+- Scan hive spark_catalog.default.right_t [id#23, name#24],
HiveTableRelation [`spark_catalog`.`default`.`right_t`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#23,
name#24], Partition Cols: []]
scala> spark.sql("select * from right_t limit 4").toJSON.explain
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- SerializeFromObject [staticinvoke(class
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0,
java.lang.String, true], true, false, true) AS value#34]
+- MapPartitions
org.apache.spark.sql.Dataset$$Lambda/0x00000070021d8c58@5b17838a, obj#33:
java.lang.String
+- DeserializeToObject createexternalrow(staticinvoke(class
java.lang.Integer, ObjectType(class java.lang.Integer), valueOf, id#27, true,
false, true), name#28.toString, StructField(id,IntegerType,true),
StructField(name,StringType,true)), obj#32: org.apache.spark.sql.Row
+- GlobalLimit 4, 0
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=42]
+- LocalLimit 4
+- Scan hive spark_catalog.default.right_t [id#27,
name#28], HiveTableRelation [`spark_catalog`.`default`.`right_t`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#27,
name#28],
```
### Why are the changes needed?
Without this patching, the simple query "select limit" or "select sort
limit" has to introduce a shuffle when return content as JSON dataset. Both
`CollectLimitExec` and `TakeOrderedAndProject` cannot be applied.
`Dataset.toJSON` as a fundamental API causes to poor performance in many
scenarios.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT added.
### Was this patch authored or co-authored using generative AI tooling?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]