Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/14671
  
    Thanks for cc me! As you might already know, I think it makes sense 
allowing to filter rowgroups but this will be also applied to row-by-row for 
normal parquet reader and this was removed by 
[SPARK-16400](https://issues.apache.org/jira/browse/SPARK-16400). So, let me 
please cc @rxin and @liancheng here.
    
    IMHO, I remember there is a concern (sorry I can't find the reference) that 
Spark-side codegen row-by-row filtering might be faster than Parquet's one in 
general due to type-boxing and virtual function calls which Spark's one tries 
to avoid.
    
    So, actually, I was thinking of bringing back this after (maybe) Parquet 
row-by-row filtering is disabled in Spark to allow to filter rowgroups properly.
    
    I am pretty sure filtering rowgroups will make sense but I am a bit 
hesitated for row-by-row one because it seems it was removed for better 
performance and bringing it back might be performance regression although the 
implementation is different. Do we maybe need a benchmark?
    
    Otherwise, maybe we should experiment to check if Spark codegen one is 
actually faster than Parquet's so that we can decide t disable row-by-row 
filtering first (although I am not sure if this was done somewhere or offline).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to