Github user liancheng commented on the issue:
https://github.com/apache/spark/pull/14649
Sorry for the late reply.
Firstly, Spark SQL only reads footers of all Parquet files in case of
schema merging, which can be controlled by SQL option
`spark.sql.parquet.mergeSchema`. Because you have to figure out schemas of
every individual physical Parquet files to determine the global schema. When
schema merging is disabled, which is the default case, summary files
(`_metadata` and/or `_common_metadata`) are still used if there're any. If no
summary files are available, Spark SQL just reads the footer of a random
Parquet file and gets the schema. So it seems that the first point mentioned in
you PR description is not really a problem?
Secondly, although you mentioned "partition pruning", but what the code
change in this PR performs is actually Parquet row group filtering, which is
already a feature of Spark SQL.
Thirdly, partition pruning is already implemented in Spark SQL.
Furthermore, since partition pruning is handled inside the framework of Spark
SQL, not only data source filters, but also arbitrary Catalyst expressions can
be used to prune partitions.
That said, I don't see benefits from this PR. Did I miss something here?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]