Rowan Chattaway created SPARK-7637:
--------------------------------------
Summary: StructType.merge slow with large nenormalised tables O(N2)
Key: SPARK-7637
URL: https://issues.apache.org/jira/browse/SPARK-7637
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 1.3.1
Reporter: Rowan Chattaway
Priority: Minor
StructType.merge does a linear scan through the left schema and for each
element scans the right schema. This results in a O(N2) algorithm.
I have found this to be very slow when dealing with large denormalised parquet
files.
I would like to make a small change to this function to map the fields of both
the left and right schemas resulting in O(N).
This has a sizable increase in performance for large denormalised schemas.
10000x10000 column merge
2891ms Original
32ms with mapped field approach.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]