[ 
https://issues.apache.org/jira/browse/SPARK-48030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YE updated SPARK-48030:
-----------------------
    Description: 
InternalRowComparableWrapper recreates row ordering for each output partition 
when SPJ is enabled. The row ordering is generated via codegen which is quite 
expensive and the output partitions might be quite large for production table 
such as hundreds of thousands partitions. We encountered this issue when 
applying SPJ with multiple large Iceberg tables and the plan phase took tens of 
minutes to complete.

Attaching a screenshot to provide related stack trace:
  !screenshot-1.png! 

A simple fix for this would be caching the rowOrdering for 
InternalRowComparableWrapper as the datatype of the InternalRow is immutable

  was:
InternalRowComparableWrapper recreates row ordering for each output partition 
when SPJ is enabled. The row ordering is generated via codegen which is quite 
expensive and the output partitions might be quite large for production table 
such as hundreds of thousands partitions. We encountered this issue when 
applying SPJ with multiple large Iceberg tables and the plan phase took tens of 
minutes to complete.

Attaching a screenshot to provide related stack trace:
 

A simple fix for this would be caching the rowOrdering for 
InternalRowComparableWrapper as the datatype of the InternalRow is immutable


> InternalRowComparableWrapper should cache rowOrdering to improve performace
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-48030
>                 URL: https://issues.apache.org/jira/browse/SPARK-48030
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.5.1, 3.4.3
>            Reporter: YE
>            Priority: Major
>         Attachments: screenshot-1.png
>
>
> InternalRowComparableWrapper recreates row ordering for each output partition 
> when SPJ is enabled. The row ordering is generated via codegen which is quite 
> expensive and the output partitions might be quite large for production table 
> such as hundreds of thousands partitions. We encountered this issue when 
> applying SPJ with multiple large Iceberg tables and the plan phase took tens 
> of minutes to complete.
> Attaching a screenshot to provide related stack trace:
>   !screenshot-1.png! 
> A simple fix for this would be caching the rowOrdering for 
> InternalRowComparableWrapper as the datatype of the InternalRow is immutable



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to