Github user pwoody commented on the issue:

    https://github.com/apache/spark/pull/15835
  
    Hey @HyukjinKwon - appreciate the feedback!
    
    Re: file touching - If I add the cache to the `_metadata` file, then this 
PR will end up touching at most one file per rootPath driver side (generally 
just one total).
    
    Re: files v.s. splits - The main difference when pruning splits instead of 
files is when you have larger files you will end up spawning tasks that 
immediately will be filtered out executor-side after grabbing the footer. For 
simplicity, if we have maxSplitBytes == Parquet row group size then a single 
hit in a file will end up spawning a task for every row group even if the file 
only has one matching block. This overhead can end up being expensive in a 
setup w/ dynamicAllocation and multi-tenancy. I generally wish to reduce the 
total number of tasks.
    
    Re: bucketed - yep sorry, will fix!
    
    Re: benchmarks - yeah totally happy to poke around and make some 
benchmarks. ParquetReadBenchmark is the appropriate place I suppose?
    
    Re: old PR - the code there is out of date and I don't believe Andrew is 
actively working on it based on his last comment. This is the follow up to 
original ER in what I believe is a more comprehensive and reliable way.
    
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to