YE created SPARK-48030:
--------------------------
Summary: InternalRowComparableWrapper should cache rowOrdering to
improve performace
Key: SPARK-48030
URL: https://issues.apache.org/jira/browse/SPARK-48030
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.4.3, 3.5.1
Reporter: YE
Attachments: screenshot-1.png
InternalRowComparableWrapper recreates row ordering for each output partition
when SPJ is enabled. The row ordering is generated via codegen which is quite
expensive and the output partitions might be quite large for production table
such as hundreds of thousands partitions. We encountered this issue when
applying SPJ with multiple large Iceberg tables and the plan phase took tens of
minutes to complete.
Attaching a screenshot to provide related stack trace:
!image-2024-04-28-20-27-54-039.png!
A simple fix for this would be caching the rowOrdering for
InternalRowComparableWrapper as the datatype of the InternalRow is immutable
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]