Github user pwoody commented on the issue:
https://github.com/apache/spark/pull/15835
Hey @HyukjinKwon - appreciate the feedback!
Re: file touching - If I add the cache to the `_metadata` file, then this
PR will end up touching at most one file per rootPath driver side (generally
just one total).
Re: files v.s. splits - The main difference when pruning splits instead of
files is when you have larger files you will end up spawning tasks that
immediately will be filtered out executor-side after grabbing the footer. For
simplicity, if we have maxSplitBytes == Parquet row group size then a single
hit in a file will end up spawning a task for every row group even if the file
only has one matching block. This overhead can end up being expensive in a
setup w/ dynamicAllocation and multi-tenancy. I generally wish to reduce the
total number of tasks.
Re: bucketed - yep sorry, will fix!
Re: benchmarks - yeah totally happy to poke around and make some
benchmarks. ParquetReadBenchmark is the appropriate place I suppose?
Re: old PR - the code there is out of date and I don't believe Andrew is
actively working on it based on his last comment. This is the follow up to
original ER in what I believe is a more comprehensive and reliable way.
Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]