tustvold commented on issue #5108:
URL: https://github.com/apache/arrow-rs/issues/5108#issuecomment-1823238517

   > Do you know where this is specified
   
   The Bloomfilter specification is 
[here](https://github.com/apache/parquet-format/blob/master/BloomFilter.md). As 
you note it never explicitly states what the BloomFilter contains, however, it 
does state:
   
   > In their current format, column statistics and dictionaries can be used 
for predicate pushdown. Statistics include minimum and maximum value, which can 
be used to filter out values not in the range. Dictionaries are more specific, 
and readers can filter out values that are between min and max but not in the 
dictionary. However, when there are too many distinct values, writers sometimes 
choose not to add dictionaries because of the extra space they occupy. This 
leaves columns with large cardinalities and widely separated min and max 
without support for predicate pushdown.
   
   The implication being that the bloom filter contains the same value data as 
is used in statistics and dictionaries.
   
   This is also borne out by the implementation in parquet-mr - 
https://github.com/apache/parquet-mr/blob/452c94d20abda0a83101d00f3b697e110d744942/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java#L210


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to