Github user liancheng commented on a diff in the pull request:
https://github.com/apache/spark/pull/12179#discussion_r60094164
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
---
@@ -288,15 +288,25 @@ case class DataSource(
val fileCatalog: FileCatalog =
new HDFSFileCatalog(sqlContext, options, globbedPaths,
partitionSchema)
- val dataSchema = userSpecifiedSchema.orElse {
+
+ val dataSchema = userSpecifiedSchema.map { schema =>
+ val equality =
+ if (sqlContext.conf.caseSensitiveAnalysis) {
+
org.apache.spark.sql.catalyst.analysis.caseSensitiveResolution
+ } else {
+
org.apache.spark.sql.catalyst.analysis.caseInsensitiveResolution
+ }
+
+ StructType(schema.filterNot(f =>
partitionColumns.exists(equality(_, f.name))))
+ }.orElse {
--- End diff --
@yhuai Answering the question we raised offline here. Without this fix, the
following test case added back in this PR fails:
> SimpleTextHadoopFsRelationSuite.SPARK-7616: adjust column name order
accordingly when saving partitioned table
The major contradiction here is that, result of `FileFormat.inferSchema()`
is a schema consists of all columns live in physical data files, and may
contain a subset of partition columns. On the otherhand, the user-specified
schema passed via `DataFrameReader.schema()` refers to the full schema of the
table, including all the partition columns. For `FileFormat` data sources whose
`inferSchema()` return `None`, we have no idea whether the physical files
contain partiiton columns or not.
To fix the regression failure, here we chop off all partition columns from
user-specified schema. But this imposes a restriction to `FileFormat` data
sources without schema inference ability:
> No partition columns are allowed in physical files.
This doesn't make much trouble for Spark built-in `FileFormat` data sources
since all of them either have fixed schema (LibSVM and text), or are able to
infer their own schema (Parquet, ORC, JSON, and CSV). I've checked that, this
restriction also exists in branch-1.6. But I'd say this restriction is more
like by accident rather than by design.
An alternative design is to alter the semantics of the user-specified
schema set via `DataFrameReader.schema()` and make it represent the schema of
the physical files. In this way, we can solve the problem unambiguously. But
this apparently this may break runtime behavior of existing user code. So seems
that living with it is a more reasonable choice for now?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]