Github user andreweduffy commented on the issue:
https://github.com/apache/spark/pull/14671
Thanks for the comments guys! Had to search through some code, but I think
I understand the current state of things. Correct me if I'm wrong, but it seems
that record-by-record filtering only occurs if the vectorized reader is
disabled, as there is no logic in SpecificParquetRecordReaderBase to perform
individual record filtering, so currently this will only be happening if we
fallback to the Parquet-provided ParquetRecordReader.
After doing some digging into ParquetRecordReader, it pushes down the
"record-by-record" filter to InternalParquetRecordReader. However, it appears
that inside of the InternalParquetRecordReader, you can actually disable
row-by-row filtering, with the magic conf `parquet.filter.record-level.enabled`
([declaration](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L121)
and
[usage](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L180))
So in theory if we set this configuration to false when we construct
ParquetRecordReader in
[ParquetFileFormat](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L381),
we wouldn't have to worry about row-by-row filtering being applied along
either code path (I think).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]