[jira] [Comment Edited] (SPARK-11500) Not deterministic order of columns when using merging schemas.

Hyukjin Kwon (JIRA) Wed, 04 Nov 2015 03:04:52 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989325#comment-14989325
 ]


Hyukjin Kwon edited comment on SPARK-11500 at 11/4/15 11:03 AM:
----------------------------------------------------------------

It is non-deterministic and also incorrect in the output, which first was found 
here https://github.com/apache/spark/pull/9327#discussion_r43846677. 

Also, (I think) the insertion order should be kept not sorted because what we 
need to do is make this deterministic. In this way, the columns of the first 
file put in {{parquet}} can stay always leading in the output.

 


was (Author: hyukjin.kwon):
It is non-deterministic and also incorrect in the output, which first is found 
here https://github.com/apache/spark/pull/9327#discussion_r43846677. 

Also, (I think) the insertion order should be kept not sorted because what we 
need to do is make this deterministic. In this way, the columns of the first 
file put in {{parquet}} can stay always leading in the output.

 

> Not deterministic order of columns when using merging schemas.
> --------------------------------------------------------------
>
>                 Key: SPARK-11500
>                 URL: https://issues.apache.org/jira/browse/SPARK-11500
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Hyukjin Kwon
>
> When executing 
> {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
> pathTwo).printSchema()}}
> The order of columns is not deterministic, showing up in a different order 
> sometimes.
> This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
> {{ParquetRelation}} extends as you know). When 
> {{FileStatusCache.listLeafFiles()}} is called, this returns 
> {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}.
> So, after retrieving the list of leaf files including {{_metadata}} and 
> {{_common_metadata}},  this starts to merge (separately and if necessary) the 
> {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in 
> {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
> column order having the leading columns (of the first file) which the other 
> files do not have.
> I think this can be resolved by using {{LinkedHashSet}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11500) Not deterministic order of columns when using merging schemas.

Reply via email to