Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/14671
Yea, that is all true. Actually, it would be okay just not to pass the
filter
[here](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L383)
for the normal parquet reader because we are setting the filter for rowgroups
[here](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L365)
for both normal one and vectorized one.
(BTW, you might already notice the actually filter2 for each row is related
in `FilteringRecordMaterializer` using `ParquetRecordMaterializer` as a
delegate in Spark side).
However, if my understanding is correct, the point is, it _seems_ even not
clear whether we should disable Parquet's row-by-row filtering or not. In my
point of view, we need a benchmark to see if Spark's codegen one is really
faster than Parquet's one for row-by-row.
I _guess_ we are assuming Spark one is faster than Parquet's one.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]