Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/14671
Thanks for cc me! As you might already know, I think it makes sense
allowing to filter rowgroups but this will be also applied to row-by-row for
normal parquet reader and this was removed by
[SPARK-16400](https://issues.apache.org/jira/browse/SPARK-16400). So, let me
please cc @rxin and @liancheng here.
IMHO, I remember there is a concern (sorry I can't find the reference) that
Spark-side codegen row-by-row filtering might be faster than Parquet's one in
general due to type-boxing and virtual function calls which Spark's one tries
to avoid.
So, actually, I was thinking of bringing back this after (maybe) Parquet
row-by-row filtering is disabled in Spark to allow to filter rowgroups properly.
I am pretty sure filtering rowgroups will make sense but I am a bit
hesitated for row-by-row one because it seems it was removed for better
performance and bringing it back might be performance regression although the
implementation is different. Do we maybe need a benchmark?
Otherwise, maybe we should experiment to check if Spark codegen one is
actually faster than Parquet's so that we can decide t disable row-by-row
filtering first (although I am not sure if this was done somewhere or offline).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]