GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/9517
[SPARK-11500][SQL] Not deterministic order of columns when using merging
schemas.
https://issues.apache.org/jira/browse/SPARK-11500
As filed in SPARK-11500, if merging schemas is enabled, the order of files
to touch is a matter which might affect the ordering of the output columns.
This was mostly because of the use of `Set` and `Map` so I replaced them to
`LinkedHashSet` and `LinkedHashMap` to keep the insertion order.
Also, reducing order is set left, and replaced the order of `filesToTouch`
from `metadataStatuses ++ commonMetadataStatuses ++ needMerged` to `needMerged
++ metadataStatuses ++ commonMetadataStatuses` in order to touch the part-files
first which always have the schema in footers whereas the others might not
exist.
One nit is, If merging schemas is enabled, but when multiple files are
given, there is no guarantee of the output order, since there might not be a
summary file for the first file, which ends up putting ahead the columns of the
other files.
However, I thought this should be okay since disabling merging schemas
means (assumes) all the files have the same schemas.
In addition, in the test code for this, I only checked the names of fields.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark SPARK-11500
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/9517.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #9517
----
commit b0e6ce2729f584a9f95996707f60eb650c2a58b9
Author: hyukjinkwon <[email protected]>
Date: 2015-11-06T07:38:26Z
[SPARK-11500][SQL] Not deterministic order of columns when using merging
schemas.
commit 08fc91ca8d21902677e78f0adb3b36769f2cba51
Author: hyukjinkwon <[email protected]>
Date: 2015-11-06T07:38:55Z
[SPARK-11500][SQL] Add a test to check the deterministic order.
commit bcf72d3ca308f9a69993803d9c8939696c915b07
Author: hyukjinkwon <[email protected]>
Date: 2015-11-06T07:40:17Z
[SPARK-11500][SQL] Remove trailing newline.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]