Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/16030
  
    After an offline discussion with @liancheng , here is the result:
    
    **Why does the test fail?**
    1. We write a parquet file with schema `[a: long, b: int]` to path 
`/data/a=1`.
    2. When read it back, we will infer the data schema as `[a: long, b: int]` 
and partition schema as `[a: int]`. 
    3. In 
[`HadoopFsRelation.schema`](https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala#L51-L56),
 we merge data schema and partition schema, and announce to users that the 
output schema will be `[a: long, b: int]`
    4. In 
[`FileSourceScanExec`](https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L252-L260),
 we build a parquet reader and tell it that the data schema is `[a: long, b: 
int]`, the required schema is `[b: int]`, the partition schema is `[a: int]`.
    5. In [vectorized parquet 
read](https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L181-L188),
 we read the data from parquet files according to the required schema: `[b: 
int]`, and append partition values according to the partition schema: `[a: 
int]`, so the schema of the physical row data is: `[b: int, a: int]`
    6. In 
[`FileSourceStrategy`](https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L74-L76),
 we mistakenly think the parquet scan will output data of schema `[b: int, a: 
long]`, and read the second column as long type. The vectorized parquet read 
can NOT read an int column as long and throw NPE.
    
    **How to fix?**
    The root cause is that, when data schema includes partition columns, how to 
determine the type of partition columns? Currently, at physical layer(the 
reader), we trust the partition schema, which is inferred from directory 
strings. At logical layer(`HadoopFsRelation`), we trust the partition columns 
inside of data schema. This inconsistency caused the bug.
    
    w.r.t. the fact that we use partition values extracted from directory 
strings and ignore the partition columns inside physical data files, we think 
it's more reasonable to trust partition schema.
    
    So the fix is simple, update `HadoopFsRelation.schema`, to respect the 
partition columns position in data schema, but also respect the partition 
columns type in partition schema.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to