Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/15835
  
    Hi @pwoody, So, if I understood this correctly, the original PR only 
filters out the files to touch ahead but this one proposes also to filter 
splits via offsets from Parquet's metadata in driver-side, right?
    
    IIRC, each task will already read the footer before it actually starts to 
read in executors and then will drop the splits at Parquet-side. This happens 
fine in both the Spark's vectorized parquet reader and normal parquet reader 
too. It might be worth reducing the files to touch for the reason I and other 
guys described in the original PR but I am not sure of pruning splits ahead.
    
    Another potential problem I see here is It seems it does not consider 
bucketed table read whereas the original PR does.
    
    In addition, I guess we really need a benchmark for the proposal to improve 
the performance. It is fine if I am wrong and this PR has a benchmark showing 
the performance improvement with a reasonable explanation.
    
    Lastly, I guess it is a followup including the changes proposed in 14649. 
Maybe, we can wait until that is merged before submitting a followup. I guess 
it is being reviewed and I think @andreweduffy is still echoing fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to