[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

mallman Fri, 20 Jul 2018 18:31:22 -0700

Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/21320
  
    > Could we move the changes made in ParquetReadSupport.scala to a separate 
PR? Then, we can merge this PR very quickly.
    
    If I remove the changes to `ParquetReadSupport.scala`, then four tests fail 
in `ParquetSchemaPruningSuite.scala`.
    
    I don't think we should/can proceed without addressing the issue of reading 
from two parquet files with identical column names and types but different 
ordering of those columns in their respective file schema. Personally, I think 
the fact that the Spark parquet reader appears to assume the same column order 
in otherwise compatible schema across files is a bug. I think column selection 
should be by name, not index. The parquet-mr reader behaves that way.
    
    As a stop-gap alternative, I suppose we could disable the built-in reader 
if parquet schema pruning is turned on. But I think that would be a rather 
ugly, invasive and confusing hack.
    
    Of course I'm open to other ideas as well.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

Reply via email to