Github user liancheng commented on the pull request:
https://github.com/apache/spark/pull/8956#issuecomment-151039952
Well, it depends. The situation is a little bit tricky to explain. In
general there are two cases:
1. String filters with high selectivity, namely most records can be dropped
- Performance
Usually, I'd expect there's no noticeable performance gain, because
each record is checked against the filter pushed down, and string operations
themselves are CPU bound. So the performance should be similar to the case no
filter is pushed down at all.
However, a properly implemented `StringStartsWithFilter.canDrop` (as
I mentioned in [this comment][1]) can bring big performance win since it can
drop entire row groups whenever possible. But this requires us to bump
parquet-mr to 1.8+ first, which is done in #9225.
- Memory footprint
What @viirya observed is reasonable. One benefit of Parquet filters
is that, they are performed before record assembling, namely we can drop a
record before converting the underlying Parquet column values into an
`InternalRow`. I think that's the reason why @viirya observed that OOM was
gone.
(BTW, `ParquetRelation` processes all the data using iterators, so we
don't read all the data first and then apply the filters. My theory is that
it's the `InternalRow` materialization costs more memory.)
1. String filters with low selectivity, namely most records can NOT be
dropped
- Performance
In this case, I'd expect performance regression. This is because
currently Spark SQL tends to be pessimistic, and always applies all the filters
again even if some of them are pushed down. In this case, almost all records
are filtered twice. Since string operations are CPU bound, this can be time
consuming.
- Memory footprint
Since the string filters in this PR "steal" the underlying byte
arrays without copying them, I'd expect the memory footprint is similar to the
normal case.
[1]: https://github.com/apache/spark/pull/8956#discussion_r42176719
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]