Github user andreweduffy commented on the issue:

    https://github.com/apache/spark/pull/14671
  
    Thanks for the comments guys! Had to search through some code, but I think 
I understand the current state of things. Correct me if I'm wrong, but it seems 
that record-by-record filtering only occurs if the vectorized reader is 
disabled, as there is no logic in SpecificParquetRecordReaderBase to perform 
individual record filtering, so currently this will only be happening if we 
fallback to the Parquet-provided ParquetRecordReader.
    
    After doing some digging into ParquetRecordReader, it pushes down the 
"record-by-record" filter to InternalParquetRecordReader. However, it appears 
that inside of the InternalParquetRecordReader, you can actually disable 
row-by-row filtering, with the magic conf `parquet.filter.record-level.enabled` 
([declaration](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L121)
 and 
[usage](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L180))
    
    So in theory if we set this configuration to false when we construct 
ParquetRecordReader in 
[ParquetFileFormat](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L381),
 we wouldn't have to worry about row-by-row filtering being applied along 
either code path (I think).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to