[GitHub] spark pull request #19579: [SPARK-22356][SQL] data source table should suppo...

gatorsmile Thu, 26 Oct 2017 13:56:49 -0700

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19579#discussion_r147264509
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
    @@ -85,14 +86,28 @@ case class CreateDataSourceTableCommand(table: 
CatalogTable, ignoreIfExists: Boo
           }
         }
     
    -    val newTable = table.copy(
    -      schema = dataSource.schema,
    -      partitionColumnNames = partitionColumnNames,
    -      // If metastore partition management for file source tables is 
enabled, we start off with
    -      // partition provider hive, but no partitions in the metastore. The 
user has to call
    -      // `msck repair table` to populate the table partitions.
    -      tracksPartitionsInCatalog = partitionColumnNames.nonEmpty &&
    -        sessionState.conf.manageFilesourcePartitions)
    +    val newTable = dataSource match {
    +      // Since Spark 2.1, we store the inferred schema of data source in 
metastore, to avoid
    +      // inferring the schema again at read path. However if the data 
source has overlapped columns
    +      // between data and partition schema, we can't store it in metastore 
as it breaks the
    +      // assumption of table schema. Here we fallback to the behavior of 
Spark prior to 2.1, store
    +      // empty schema in metastore and infer it at runtime. Note that this 
also means the new
    +      // scalable partitioning handling feature(introduced at Spark 2.1) 
is disabled in this case.
    +      case r: HadoopFsRelation if r.overlappedPartCols.nonEmpty =>
    +        table.copy(schema = new StructType(), partitionColumnNames = Nil)
    --- End diff --
    
    Log a warning message here? When data columns and partition columns have 
the same names, the values could be inconsistent. Thus, we do not suggest users 
to create such a table and it might perform well because we infer the schema at 
runtime.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19579: [SPARK-22356][SQL] data source table should suppo...

Reply via email to