[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

HyukjinKwon Thu, 05 Nov 2015 23:52:55 -0800

GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/9517


    [SPARK-11500][SQL] Not deterministic order of columns when using merging 
schemas.

    https://issues.apache.org/jira/browse/SPARK-11500
    
    As filed in SPARK-11500, if merging schemas is enabled, the order of files 
to touch is a matter which might affect the ordering of the output columns. 
    
    This was mostly because of the use of `Set` and `Map` so I replaced them to 
`LinkedHashSet` and `LinkedHashMap` to keep the insertion order.
    
    Also, reducing order is set left, and replaced the order of `filesToTouch` 
from `metadataStatuses ++ commonMetadataStatuses ++ needMerged` to  `needMerged 
++ metadataStatuses ++ commonMetadataStatuses` in order to touch the part-files 
first which always have the schema in footers whereas the others might not 
exist.
    
    One nit is, If merging schemas is enabled, but when multiple files are 
given, there is no guarantee of the output order, since there might not be a 
summary file for the first file, which ends up putting ahead the columns of the 
other files. 
    
    However, I thought this should be okay since disabling merging schemas 
means (assumes) all the files have the same schemas.
    
    In addition, in the test code for this, I only checked the names of fields.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-11500

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9517.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9517
    
----
commit b0e6ce2729f584a9f95996707f60eb650c2a58b9
Author: hyukjinkwon <[email protected]>
Date:   2015-11-06T07:38:26Z

    [SPARK-11500][SQL] Not deterministic order of columns when using merging 
schemas.

commit 08fc91ca8d21902677e78f0adb3b36769f2cba51
Author: hyukjinkwon <[email protected]>
Date:   2015-11-06T07:38:55Z

    [SPARK-11500][SQL] Add a test to check the deterministic order.

commit bcf72d3ca308f9a69993803d9c8939696c915b07
Author: hyukjinkwon <[email protected]>
Date:   2015-11-06T07:40:17Z

    [SPARK-11500][SQL] Remove trailing newline.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Reply via email to