[jira] [Updated] (SPARK-7637) StructType.merge slow with large nenormalised tables O(N2)

Rowan Chattaway (JIRA) Thu, 14 May 2015 08:16:49 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rowan Chattaway updated SPARK-7637:
-----------------------------------
    Description: 
StructType.merge does a linear scan through the left schema and for each 
element scans the right schema. This results in a O(N2) algorithm. 
I have found this to be very slow when dealing with large denormalised parquet 
files.
I would like to make a small change to this function to map the fields of both 
the left and right schemas resulting in O(N).
This has a sizable increase in performance for large denormalised schemas.

10000x10000 column merge 
2891ms Original  
32ms with mapped field approach.

This merge can be called many times depending upon the number of files that you 
need to merge the schemas for, compounding the performance.

  was:
StructType.merge does a linear scan through the left schema and for each 
element scans the right schema. This results in a O(N2) algorithm. 
I have found this to be very slow when dealing with large denormalised parquet 
files.
I would like to make a small change to this function to map the fields of both 
the left and right schemas resulting in O(N).
This has a sizable increase in performance for large denormalised schemas.

10000x10000 column merge 
2891ms Original  
32ms with mapped field approach.


> StructType.merge slow with large nenormalised tables O(N2)
> ----------------------------------------------------------
>
>                 Key: SPARK-7637
>                 URL: https://issues.apache.org/jira/browse/SPARK-7637
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.3.1
>            Reporter: Rowan Chattaway
>            Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> StructType.merge does a linear scan through the left schema and for each 
> element scans the right schema. This results in a O(N2) algorithm. 
> I have found this to be very slow when dealing with large denormalised 
> parquet files.
> I would like to make a small change to this function to map the fields of 
> both the left and right schemas resulting in O(N).
> This has a sizable increase in performance for large denormalised schemas.
> 10000x10000 column merge 
> 2891ms Original  
> 32ms with mapped field approach.
> This merge can be called many times depending upon the number of files that 
> you need to merge the schemas for, compounding the performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-7637) StructType.merge slow with large nenormalised tables O(N2)

Reply via email to