[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

HyukjinKwon Sun, 08 Nov 2015 23:35:13 -0800

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/9517#discussion_r44247117

--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
---
@@ -461,13 +461,29 @@ private[sql] class ParquetRelation(
// You should enable this configuration only if you are very
sure that for the parquet
// part-files to read there are corresponding summary files
containing correct schema.

+ // As filed in SPARK-11500, the order of files to touch is a
matter, which might affect
+ // the ordering of the output columns. There are several things
to mention here.
+ //
+ // 1. If mergeRespectSummaries config is false, then it merges
schemas by reducing from
+ // the first part-file so that the columns of the first file
show first.
+ //
+ // 2. If mergeRespectSummaries config is true, then there
should be, at least,
+ // "_metadata"s for all given files. So, we can ensure the
columns of the first file
+ // show first.
+ //
+ // 3. If shouldMergeSchemas is false, but when multiple files
are given, there is
+ // no guarantee of the output order, since there might not
be a summary file for the
+ // first file, which ends up putting ahead the columns of
the other files. However,
+ // this should be okay since not enabling shouldMergeSchemas
means (assumes) all the
+ // files have the same schemas.
+
val needMerged: Seq[FileStatus] =
if (mergeRespectSummaries) {
Seq()
} else {
dataStatuses
}
- (metadataStatuses ++ commonMetadataStatuses ++ needMerged).toSeq
+ needMerged ++ metadataStatuses ++ commonMetadataStatuses
--- End diff --

Yes, I think I should sort them.
It looks it is not really recommended just to use it as it is, although
they looks sorted, assuming from [this
link](http://lucene.472066.n3.nabble.com/FileSystem-contract-of-listStatus-td3475540.html).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-11500][SQL] Not deterministic order of ...

Reply via email to