[
https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-11500:
------------------------------------
Assignee: Apache Spark (was: Hyukjin Kwon)
> Not deterministic order of columns when using merging schemas.
> --------------------------------------------------------------
>
> Key: SPARK-11500
> URL: https://issues.apache.org/jira/browse/SPARK-11500
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Reporter: Hyukjin Kwon
> Assignee: Apache Spark
>
> When executing
> {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne,
> pathTwo).printSchema()}}
> The order of columns is not deterministic, showing up in a different order
> sometimes.
> This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which
> {{ParquetRelation}} extends as you know). When
> {{FileStatusCache.listLeafFiles()}} is called, this returns
> {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}.
> So, after retrieving the list of leaf files including {{_metadata}} and
> {{_common_metadata}}, this starts to merge (separately and if necessary) the
> {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in
> {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different
> column order having the leading columns (of the first file) which the other
> files do not have.
> I think this can be resolved by using {{LinkedHashSet}}.
> in a simple view,
> If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which
> column shows first since It is not deterministic.
> 1. Read file list (A and B)
> 2. Not deterministic order of (A and B or B and A) as I said.
> 3. It merges by {{reduceOption}} with retrieved schemas of (A and B or B and
> A), (which maybe also should be {{reduceOptionRight}} or
> {{reduceOptionLeft}}).
> 4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B
> and A.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]