Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12179#discussion_r60094164
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
    @@ -288,15 +288,25 @@ case class DataSource(
     
             val fileCatalog: FileCatalog =
               new HDFSFileCatalog(sqlContext, options, globbedPaths, 
partitionSchema)
    -        val dataSchema = userSpecifiedSchema.orElse {
    +
    +        val dataSchema = userSpecifiedSchema.map { schema =>
    +          val equality =
    +            if (sqlContext.conf.caseSensitiveAnalysis) {
    +              
org.apache.spark.sql.catalyst.analysis.caseSensitiveResolution
    +            } else {
    +              
org.apache.spark.sql.catalyst.analysis.caseInsensitiveResolution
    +            }
    +
    +          StructType(schema.filterNot(f => 
partitionColumns.exists(equality(_, f.name))))
    +        }.orElse {
    --- End diff --
    
    @yhuai Answering the question we raised offline here. Without this fix, the 
following test case added back in this PR fails:
    
    > SimpleTextHadoopFsRelationSuite.SPARK-7616: adjust column name order 
accordingly when saving partitioned table
    
    The major contradiction here is that, result of `FileFormat.inferSchema()` 
is a schema consists of all columns live in physical data files, and may 
contain a subset of partition columns. On the otherhand, the user-specified 
schema passed via `DataFrameReader.schema()` refers to the full schema of the 
table, including all the partition columns. For `FileFormat` data sources whose 
`inferSchema()` return `None`, we have no idea whether the physical files 
contain partiiton columns or not.
    
    To fix the regression failure, here we chop off all partition columns from 
user-specified schema. But this imposes a restriction to `FileFormat` data 
sources without schema inference ability:
    
    > No partition columns are allowed in physical files.
    
    This doesn't make much trouble for Spark built-in `FileFormat` data sources 
since all of them either have fixed schema (LibSVM and text), or are able to 
infer their own schema (Parquet, ORC, JSON, and CSV). I've checked that, this 
restriction also exists in branch-1.6. But I'd say this restriction is more 
like by accident rather than by design.
    
    An alternative design is to alter the semantics of the user-specified 
schema set via `DataFrameReader.schema()` and make it represent the schema of 
the physical files. In this way, we can solve the problem unambiguously. But 
this apparently this may break runtime behavior of existing user code. So seems 
that living with it is a more reasonable choice for now?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to