tustvold commented on issue #5108: URL: https://github.com/apache/arrow-rs/issues/5108#issuecomment-1823238517
> Do you know where this is specified The Bloomfilter specification is [here](https://github.com/apache/parquet-format/blob/master/BloomFilter.md). As you note it never explicitly states what the BloomFilter contains, however, it does state: > In their current format, column statistics and dictionaries can be used for predicate pushdown. Statistics include minimum and maximum value, which can be used to filter out values not in the range. Dictionaries are more specific, and readers can filter out values that are between min and max but not in the dictionary. However, when there are too many distinct values, writers sometimes choose not to add dictionaries because of the extra space they occupy. This leaves columns with large cardinalities and widely separated min and max without support for predicate pushdown. The implication being that the bloom filter contains the same value data as is used in statistics and dictionaries. This is also borne out by the implementation in parquet-mr - https://github.com/apache/parquet-mr/blob/452c94d20abda0a83101d00f3b697e110d744942/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java#L210 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
