[ https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989325#comment-14989325 ]
Hyukjin Kwon edited comment on SPARK-11500 at 11/4/15 11:03 AM: ---------------------------------------------------------------- It is non-deterministic and also incorrect in the output, which first was found here https://github.com/apache/spark/pull/9327#discussion_r43846677. Also, (I think) the insertion order should be kept not sorted because what we need to do is make this deterministic. In this way, the columns of the first file put in {{parquet}} can stay always leading in the output. was (Author: hyukjin.kwon): It is non-deterministic and also incorrect in the output, which first is found here https://github.com/apache/spark/pull/9327#discussion_r43846677. Also, (I think) the insertion order should be kept not sorted because what we need to do is make this deterministic. In this way, the columns of the first file put in {{parquet}} can stay always leading in the output. > Not deterministic order of columns when using merging schemas. > -------------------------------------------------------------- > > Key: SPARK-11500 > URL: https://issues.apache.org/jira/browse/SPARK-11500 > Project: Spark > Issue Type: Bug > Components: SQL > Reporter: Hyukjin Kwon > > When executing > {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, > pathTwo).printSchema()}} > The order of columns is not deterministic, showing up in a different order > sometimes. > This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which > {{ParquetRelation}} extends as you know). When > {{FileStatusCache.listLeafFiles()}} is called, this returns > {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}. > So, after retrieving the list of leaf files including {{_metadata}} and > {{_common_metadata}}, this starts to merge (separately and if necessary) the > {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in > {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different > column order having the leading columns (of the first file) which the other > files do not have. > I think this can be resolved by using {{LinkedHashSet}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org