gengliangwang opened a new pull request #24284: [SPARK-27356][SQL] File source 
V2: Fix the case that data columns overlap with partition schema
URL: https://github.com/apache/spark/pull/24284
 
 
   ## What changes were proposed in this pull request?
   
   The method `Scan.readSchema` returns the actual schema of this data source 
scan. 
   In the current file source V2 framework, the schema is not returned 
correctly if there are overlap columns between `dataSchema` and 
`partitionSchema`. The actual schema should be 
   `dataSchema - overlapSchema + partitionSchema`, which is different from from 
the pushed down `requiredSchema`. (The pushed down `requiredSchema` may have 
different column order in such case, see 
`PartitioningUtils.mergeDataAndPartitionSchema`)
   
   This PR is to:
   1. Bug fix: fix the corner case that `dataSchema` overlaps with 
`partitionSchema`.
   2. Improvement: Prune partition column values if part of the partition 
columns are not required.
   3. Behavior change: To make it simple, the schema of `FileTable` is 
`dataSchema - overlapSchema + partitionSchema`, instead of mixing data schema 
and partitionSchema (see `PartitioningUtils.mergeDataAndPartitionSchema`)
   For example, the data schema is [a,b,c], the partition schema is [b,d],
   In V1, the schema of `HadoopFsRelation` is [a, b, c, d]
   in File source V2 , the schema of `FileTable` is [a, c, b, d]
   Putting all the partition columns to the end of table schema is more 
reasonable. Also, when there is `select *` operation and there is no schema 
pruning, the schema of `FileTable` and `FileScan` still matches.
   
   ## How was this patch tested?
   
   Unit test.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to