Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19579#discussion_r147264509
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
---
@@ -85,14 +86,28 @@ case class CreateDataSourceTableCommand(table:
CatalogTable, ignoreIfExists: Boo
}
}
- val newTable = table.copy(
- schema = dataSource.schema,
- partitionColumnNames = partitionColumnNames,
- // If metastore partition management for file source tables is
enabled, we start off with
- // partition provider hive, but no partitions in the metastore. The
user has to call
- // `msck repair table` to populate the table partitions.
- tracksPartitionsInCatalog = partitionColumnNames.nonEmpty &&
- sessionState.conf.manageFilesourcePartitions)
+ val newTable = dataSource match {
+ // Since Spark 2.1, we store the inferred schema of data source in
metastore, to avoid
+ // inferring the schema again at read path. However if the data
source has overlapped columns
+ // between data and partition schema, we can't store it in metastore
as it breaks the
+ // assumption of table schema. Here we fallback to the behavior of
Spark prior to 2.1, store
+ // empty schema in metastore and infer it at runtime. Note that this
also means the new
+ // scalable partitioning handling feature(introduced at Spark 2.1)
is disabled in this case.
+ case r: HadoopFsRelation if r.overlappedPartCols.nonEmpty =>
+ table.copy(schema = new StructType(), partitionColumnNames = Nil)
--- End diff --
Log a warning message here? When data columns and partition columns have
the same names, the values could be inconsistent. Thus, we do not suggest users
to create such a table and it might perform well because we infer the schema at
runtime.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]