[
https://issues.apache.org/jira/browse/SPARK-48030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
YE updated SPARK-48030:
-----------------------
Description:
InternalRowComparableWrapper recreates row ordering for each output partition
when SPJ is enabled. The row ordering is generated via codegen which is quite
expensive and the output partitions might be quite large for production table
such as hundreds of thousands partitions. We encountered this issue when
applying SPJ with multiple large Iceberg tables and the plan phase took tens of
minutes to complete.
Attaching a screenshot to provide related stack trace:
!screenshot-1.png!
A simple fix for this would be caching the rowOrdering for
InternalRowComparableWrapper as the datatype of the InternalRow is immutable
was:
InternalRowComparableWrapper recreates row ordering for each output partition
when SPJ is enabled. The row ordering is generated via codegen which is quite
expensive and the output partitions might be quite large for production table
such as hundreds of thousands partitions. We encountered this issue when
applying SPJ with multiple large Iceberg tables and the plan phase took tens of
minutes to complete.
Attaching a screenshot to provide related stack trace:
A simple fix for this would be caching the rowOrdering for
InternalRowComparableWrapper as the datatype of the InternalRow is immutable
> InternalRowComparableWrapper should cache rowOrdering to improve performace
> ---------------------------------------------------------------------------
>
> Key: SPARK-48030
> URL: https://issues.apache.org/jira/browse/SPARK-48030
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.5.1, 3.4.3
> Reporter: YE
> Priority: Major
> Attachments: screenshot-1.png
>
>
> InternalRowComparableWrapper recreates row ordering for each output partition
> when SPJ is enabled. The row ordering is generated via codegen which is quite
> expensive and the output partitions might be quite large for production table
> such as hundreds of thousands partitions. We encountered this issue when
> applying SPJ with multiple large Iceberg tables and the plan phase took tens
> of minutes to complete.
> Attaching a screenshot to provide related stack trace:
> !screenshot-1.png!
> A simple fix for this would be caching the rowOrdering for
> InternalRowComparableWrapper as the datatype of the InternalRow is immutable
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]