Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/13701
Yeah, Parquet doesn't make a distinction for where filters are applied. If
you push a filter, then it will be applied to row groups if possible and
individual rows after that. But if you're bypassing the record assembly in
Parquet to get vectorized reads, then you don't have to worry about the latter.
I'm not sure if Spark does that, since I've not had a chance to get up to speed
with Spark's non-Hive read path for Parquet.
What you can do to work around this is to have an ID column that is written
in order. Then you can use the column stats to see where the divisions between
row groups ends up and set your filters so that row group filtering is the only
thing that will be applied.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]