Github user liancheng commented on the issue:
https://github.com/apache/spark/pull/16030
Another thing that @cloud-fan and I agreed upon is that our current
`DataFrameReader.schema()` interface method is insufficient. We may want to add
- `DataFrameReader.dataSchema()`, and
- `DataFrameReader.partitoinSchema()`
in the future to capture cases where data schema and partition schema
overlap with each other.
The reason is that `FileFormat.buildReader()` requires a `dataSchema`
argument, which should be the full schema of the underlying data files and may
contain one or more partition columns. The reason is that many data sources
(e.g. CSV and LibSVM) simply don't support column pruning, and you have to read
out all physical columns first and then project out unnecessary ones.
Say we have a table with the following schema information:
- Data schema: `[x, y, p1]`
- Partition schema: `[p1, p2]`
Say a user tries to read this table with a user-specified schema `[x, y,
p1, p2]`. Now we've no idea whether `p1` is part of the data schema or not
since we skip schema inference when a user-specified schema is provided.
Although I haven't tested it extensively, I'd say it would be problematic
if we try to use user-specified schema together with data sources without
column pruning support (e.g. CSV) when the data schema overlaps with partition
schema.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]