Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/16030
  
    @brkyvz @maropu Actually, we do allow users to create partitioned tables 
that allow data schema to contains (part of) the partition columns, and there 
are [test][1] [cases][2] for this use case.
    
    This use case is mostly useful when you are trying to reorganize an 
existing dataset into a partitioned form. Say you have a JSON dataset 
containing all the tweets in 2016 and you'd like to partition it by date. By 
allowing the data schema to contain partitioned columns, you may simply put 
JSON files of the same date into the same directory. Otherwise, you'll have to 
run an ETL job to erase the date column from the dataset, which can be 
time-consuming.
    
    As for the query @maropu mentioned in the PR description, the query itself 
is problematic, because it lacks a user-specified schema to override the data 
type of the partitioned column `a`. Ideally, partition discovery should be able 
to fill in the correct data type `LongType`, but it's impossible since the 
directory path doesn't really expose that information. That's why a 
user-specified schema is necessary.
    
    This query works in 2.0.2 because of the bug @brkyvz fixed: Spark ignores 
the data types of partitioned columns specified in the user-specified schema. 
Now the bug is fixed and this query doesn't work as expected.
    
    In short:
    
    1. This isn't a regression, the original query itself is problematic.
    2. For this PR, we can either just close it or try to provide a better 
error message in the read path (ask the user to provide a user-specified 
schema) when:
    
       - A partitioned columns `p` also appears in the data schema, and
       - The discovered data type of `p` is different from the data type 
specified in the data schema.
    
       An alternative is to override the discovered partition column data type 
using the one in the data schema if any. But I'd say this change is probably 
too risky at this moment for 2.1.
    
    [1]: 
https://github.com/apache/spark/blob/c51c7725944d60738e2bac3e11f6aea74812905c/sql/hive/src/test/scala/org/apache/spark/sql/sources/ParquetHadoopFsRelationSuite.scala#L44-L66
    [2]: 
https://github.com/apache/spark/blob/c51c7725944d60738e2bac3e11f6aea74812905c/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcHadoopFsRelationSuite.scala#L43-L66


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to