[ 
https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-11500:
---------------------------------
    Description: 
When executing 

{{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
pathTwo).printSchema()}}

The order of columns is not deterministic, showing up in a different order 
sometimes.

This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
{{ParquetRelation}} extends as you know). When 
{{FileStatusCache.listLeafFiles()}} is called, this returns {{Set[FileStatus]}} 
which messes up the order of {{Array[FileStatus]}}.

So, after retrieving the list of leaf files including {{_metadata}} and 
{{_common_metadata}},  this starts to merge (separately and if necessary) the 
{{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in 
{{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
column order having the leading columns (of the first file) which the other 
files do not have.

I think this can be resolved by using {{LinkedHashSet}}.

  was:
When executing 

{{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
pathTwo).printSchema()}}

The order of columns is not deterministic, showing up in a different order 
sometimes.

This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
{{ParquetRelation}} extends as you know). When 
{{FileStatusCache.listLeafFiles()}} is called, this returns {{Set[FileStatus]}} 
which messes up the order of {{Array[FileStatus]}}.

So, after retrieving the list of leaf files including {{_metadata}} and 
{{_common_metadata}},  this starts to merge (separately and if necessary) the 
{{Set}}s of {{_metadata}}, {{_common_metadata}} and part-files in 
{{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
column order having the leading columns (of the first file) which the other 
files do not have.

I think this can be resolved by using {{LinkedHashSet}}.


> Not deterministic order of columns when using merging schemas.
> --------------------------------------------------------------
>
>                 Key: SPARK-11500
>                 URL: https://issues.apache.org/jira/browse/SPARK-11500
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Hyukjin Kwon
>
> When executing 
> {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
> pathTwo).printSchema()}}
> The order of columns is not deterministic, showing up in a different order 
> sometimes.
> This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
> {{ParquetRelation}} extends as you know). When 
> {{FileStatusCache.listLeafFiles()}} is called, this returns 
> {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}.
> So, after retrieving the list of leaf files including {{_metadata}} and 
> {{_common_metadata}},  this starts to merge (separately and if necessary) the 
> {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in 
> {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
> column order having the leading columns (of the first file) which the other 
> files do not have.
> I think this can be resolved by using {{LinkedHashSet}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to