Rowan Chattaway created SPARK-7637:
--------------------------------------

             Summary: StructType.merge slow with large nenormalised tables O(N2)
                 Key: SPARK-7637
                 URL: https://issues.apache.org/jira/browse/SPARK-7637
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 1.3.1
            Reporter: Rowan Chattaway
            Priority: Minor


StructType.merge does a linear scan through the left schema and for each 
element scans the right schema. This results in a O(N2) algorithm. 
I have found this to be very slow when dealing with large denormalised parquet 
files.
I would like to make a small change to this function to map the fields of both 
the left and right schemas resulting in O(N).
This has a sizable increase in performance for large denormalised schemas.

10000x10000 column merge 
2891ms Original  
32ms with mapped field approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to