gengliangwang opened a new pull request #24284: [SPARK-27356][SQL] File source V2: Fix the case that data columns overlap with partition schema URL: https://github.com/apache/spark/pull/24284 ## What changes were proposed in this pull request? The method `Scan.readSchema` returns the actual schema of this data source scan. In the current file source V2 framework, the schema is not returned correctly if there are overlap columns between `dataSchema` and `partitionSchema`. The actual schema should be `dataSchema - overlapSchema + partitionSchema`, which is different from from the pushed down `requiredSchema`. (The pushed down `requiredSchema` may have different column order in such case, see `PartitioningUtils.mergeDataAndPartitionSchema`) This PR is to: 1. Bug fix: fix the corner case that `dataSchema` overlaps with `partitionSchema`. 2. Improvement: Prune partition column values if part of the partition columns are not required. 3. Behavior change: To make it simple, the schema of `FileTable` is `dataSchema - overlapSchema + partitionSchema`, instead of mixing data schema and partitionSchema (see `PartitioningUtils.mergeDataAndPartitionSchema`) For example, the data schema is [a,b,c], the partition schema is [b,d], In V1, the schema of `HadoopFsRelation` is [a, b, c, d] in File source V2 , the schema of `FileTable` is [a, c, b, d] Putting all the partition columns to the end of table schema is more reasonable. Also, when there is `select *` operation and there is no schema pruning, the schema of `FileTable` and `FileScan` still matches. ## How was this patch tested? Unit test.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
