Github user andreweduffy commented on the issue:
https://github.com/apache/spark/pull/14649
Hyun mostly sums it up. This uses the summary metadata for Parquet when
available. Rather than performing row group level filtering, it actually
filters out entire files when summary metadata is available. It does this when
it's constructing the FileScanRDD, which means it actually only spawns tasks
for files that match the predicate. At work we were running into issues with S3
deployments where very large S3 datasets would take exceedingly long to load in
Spark. Empirically, we're running this exact patch in production and for many
types of queries, we see a very large decrease in tasks created and time spent
fetching from S3. So this is mainly for the use case of short-lived RDDs (so
doing .persist doesn't help you) that are backed by data in S3 (so eliminating
read time is actually a significant speed up)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]