[
https://issues.apache.org/jira/browse/SPARK-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rowan Chattaway updated SPARK-7637:
-----------------------------------
Description:
StructType.merge does a linear scan through the left schema and for each
element scans the right schema. This results in a O(N2) algorithm.
I have found this to be very slow when dealing with large denormalised parquet
files.
I would like to make a small change to this function to map the fields of both
the left and right schemas resulting in O(N).
This has a sizable increase in performance for large denormalised schemas.
10000x10000 column merge
2891ms Original
32ms with mapped field approach.
This merge can be called many times depending upon the number of files that you
need to merge the schemas for, compounding the performance.
was:
StructType.merge does a linear scan through the left schema and for each
element scans the right schema. This results in a O(N2) algorithm.
I have found this to be very slow when dealing with large denormalised parquet
files.
I would like to make a small change to this function to map the fields of both
the left and right schemas resulting in O(N).
This has a sizable increase in performance for large denormalised schemas.
10000x10000 column merge
2891ms Original
32ms with mapped field approach.
> StructType.merge slow with large nenormalised tables O(N2)
> ----------------------------------------------------------
>
> Key: SPARK-7637
> URL: https://issues.apache.org/jira/browse/SPARK-7637
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.3.1
> Reporter: Rowan Chattaway
> Priority: Minor
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> StructType.merge does a linear scan through the left schema and for each
> element scans the right schema. This results in a O(N2) algorithm.
> I have found this to be very slow when dealing with large denormalised
> parquet files.
> I would like to make a small change to this function to map the fields of
> both the left and right schemas resulting in O(N).
> This has a sizable increase in performance for large denormalised schemas.
> 10000x10000 column merge
> 2891ms Original
> 32ms with mapped field approach.
> This merge can be called many times depending upon the number of files that
> you need to merge the schemas for, compounding the performance.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]