Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/13701
  
    Yeah, Parquet doesn't make a distinction for where filters are applied. If 
you push a filter, then it will be applied to row groups if possible and 
individual rows after that. But if you're bypassing the record assembly in 
Parquet to get vectorized reads, then you don't have to worry about the latter. 
I'm not sure if Spark does that, since I've not had a chance to get up to speed 
with Spark's non-Hive read path for Parquet.
    
    What you can do to work around this is to have an ID column that is written 
in order. Then you can use the column stats to see where the divisions between 
row groups ends up and set your filters so that row group filtering is the only 
thing that will be applied.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to