[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...

liancheng Tue, 27 Sep 2016 09:05:58 -0700

Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14649
  
    Sorry for the late reply.
    
    Firstly, Spark SQL only reads footers of all Parquet files in case of 
schema merging, which can be controlled by SQL option 
`spark.sql.parquet.mergeSchema`. Because you have to figure out schemas of 
every individual physical Parquet files to determine the global schema. When 
schema merging is disabled, which is the default case, summary files 
(`_metadata` and/or `_common_metadata`) are still used if there're any. If no 
summary files are available, Spark SQL just reads the footer of a random 
Parquet file and gets the schema. So it seems that the first point mentioned in 
you PR description is not really a problem?
    
    Secondly, although you mentioned "partition pruning", but what the code 
change in this PR performs is actually Parquet row group filtering, which is 
already a feature of Spark SQL.
    
    Thirdly, partition pruning is already implemented in Spark SQL. 
Furthermore, since partition pruning is handled inside the framework of Spark 
SQL, not only data source filters, but also arbitrary Catalyst expressions can 
be used to prune partitions.
    
    That said, I don't see benefits from this PR. Did I miss something here?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14649: [SPARK-17059][SQL] Allow FileFormat to specify partition...

Reply via email to