EmilyMatt commented on code in PR #23245:
URL: https://github.com/apache/datafusion/pull/23245#discussion_r3493779651
##########
datafusion/datasource-parquet/src/bloom_filter.rs:
##########
@@ -97,7 +97,12 @@ impl BloomFilterStatistics {
| ScalarValue::BinaryView(Some(v))
| ScalarValue::LargeBinary(Some(v)) => sbbf.check(v),
ScalarValue::FixedSizeBinary(_size, Some(v)) => sbbf.check(v),
- ScalarValue::Boolean(Some(v)) => sbbf.check(v),
+ // Parquet Java doesn't update boolean values in a Bloom filter,
+ // which results in a empty filter, for which `sbbf.check()`
always returns `false`.
+ // See
https://github.com/apache/parquet-java/blob/52c0a5e8c5ff7680cc299ce5aad60acef3a9054d/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnValueCollector.java#L75
+ // In order to correctly read such files with SBBF, we have to
skip the check and return `true`
+ // because values may be present in the data file even if they are
not in the filter.
+ ScalarValue::Boolean(Some(_)) => true,
Review Comment:
Please make this configurable
While I think this is worthwhile(especially for Spark/Hadoop parquets
usecases), we definitely want to support boolean values here, probably even by
default
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]