Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/8956#issuecomment-151039952
  
    Well, it depends. The situation is a little bit tricky to explain. In 
general there are two cases:
    
    1.  String filters with high selectivity, namely most records can be dropped
    
        - Performance
    
          Usually, I'd expect there's no noticeable performance gain, because 
each record is checked against the filter pushed down, and string operations 
themselves are CPU bound. So the performance should be similar to the case no 
filter is pushed down at all.
    
          However, a properly implemented `StringStartsWithFilter.canDrop` (as 
I mentioned in [this comment][1]) can bring big performance win since it can 
drop entire row groups whenever possible. But this requires us to bump 
parquet-mr to 1.8+ first, which is done in #9225.
    
        - Memory footprint
    
          What @viirya observed is reasonable.  One benefit of Parquet filters 
is that, they are performed before record assembling, namely we can drop a 
record before converting the underlying Parquet column values into an 
`InternalRow`.  I think that's the reason why @viirya observed that OOM was 
gone.
    
          (BTW, `ParquetRelation` processes all the data using iterators, so we 
don't read all the data first and then apply the filters. My theory is that 
it's the `InternalRow` materialization costs more memory.)
    
    1.  String filters with low selectivity, namely most records can NOT be 
dropped
    
        - Performance
    
          In this case, I'd expect performance regression. This is because 
currently Spark SQL tends to be pessimistic, and always applies all the filters 
again even if some of them are pushed down. In this case, almost all records 
are filtered twice. Since string operations are CPU bound, this can be time 
consuming.
    
        - Memory footprint
    
          Since the string filters in this PR "steal" the underlying byte 
arrays without copying them, I'd expect the memory footprint is similar to the 
normal case.
    
    [1]: https://github.com/apache/spark/pull/8956#discussion_r42176719



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to