[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...

liancheng Wed, 14 Dec 2016 10:33:57 -0800

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/16030
  
    Another thing that @cloud-fan and I agreed upon is that our current 
`DataFrameReader.schema()` interface method is insufficient. We may want to add
    
    - `DataFrameReader.dataSchema()`, and
    - `DataFrameReader.partitoinSchema()`
    
    in the future to capture cases where data schema and partition schema 
overlap with each other.
    
    The reason is that `FileFormat.buildReader()` requires a `dataSchema` 
argument, which should be the full schema of the underlying data files and may 
contain one or more partition columns. The reason is that many data sources 
(e.g. CSV and LibSVM) simply don't support column pruning, and you have to read 
out all physical columns first and then project out unnecessary ones.
    
    Say we have a table with the following schema information:
    
    - Data schema: `[x, y, p1]`
    - Partition schema: `[p1, p2]`
    
    Say a user tries to read this table with a user-specified schema `[x, y, 
p1, p2]`. Now we've no idea whether `p1` is part of the data schema or not 
since we skip schema inference when a user-specified schema is provided.
    
    Although I haven't tested it extensively, I'd say it would be problematic 
if we try to use user-specified schema together with data sources without 
column pruning support (e.g. CSV) when the data schema overlaps with partition 
schema.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #16030: [SPARK-18108][SQL] Fix a bug to fail partition schema in...

Reply via email to