Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/15835
Hi @pwoody, So, if I understood this correctly, the original PR only
filters out the files to touch ahead but this one proposes also to filter
splits via offsets from Parquet's metadata in driver-side, right?
IIRC, each task will already read the footer before it actually starts to
read in executors and then will drop the splits at Parquet-side. This happens
fine in both the Spark's vectorized parquet reader and normal parquet reader
too. It might be worth reducing the files to touch for the reason I and other
guys described in the original PR but I am not sure of pruning splits ahead.
Another potential problem I see here is It seems it does not consider
bucketed table read whereas the original PR does.
In addition, I guess we really need a benchmark for the proposal to improve
the performance. It is fine if I am wrong and this PR has a benchmark showing
the performance improvement with a reasonable explanation.
Lastly, I guess it is a followup including the changes proposed in 14649.
Maybe, we can wait until that is merged before submitting a followup. I guess
it is being reviewed and I think @andreweduffy is still echoing fine.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]