YE created SPARK-48030:
--------------------------

             Summary: InternalRowComparableWrapper should cache rowOrdering to 
improve performace
                 Key: SPARK-48030
                 URL: https://issues.apache.org/jira/browse/SPARK-48030
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.4.3, 3.5.1
            Reporter: YE
         Attachments: screenshot-1.png

InternalRowComparableWrapper recreates row ordering for each output partition 
when SPJ is enabled. The row ordering is generated via codegen which is quite 
expensive and the output partitions might be quite large for production table 
such as hundreds of thousands partitions. We encountered this issue when 
applying SPJ with multiple large Iceberg tables and the plan phase took tens of 
minutes to complete.

Attaching a screenshot to provide related stack trace:

!image-2024-04-28-20-27-54-039.png!

 

A simple fix for this would be caching the rowOrdering for 
InternalRowComparableWrapper as the datatype of the InternalRow is immutable



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to