[ https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699694#comment-17699694 ]
Gang Wu commented on PARQUET-2256: ---------------------------------- Apache ORC supports compression of bloom filter. It would be nice if we can do the similar thing. However, I think there is a prerequisite (at least highly relevant): https://issues.apache.org/jira/browse/PARQUET-2257 > Adding Compression for BloomFilter > ---------------------------------- > > Key: PARQUET-2256 > URL: https://issues.apache.org/jira/browse/PARQUET-2256 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Affects Versions: format-2.9.0 > Reporter: Xuwei Fu > Priority: Major > > In Current Parquet implementions, if BloomFilter doesn't set the ndv, most > implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is > 0.01, the BloomFilter size may grows to 2M for each column, which is really > huge. Should we support compression for BloomFilter, like: > > ``` > /** > * The compression used in the Bloom filter. > **/ > struct Uncompressed {} > union BloomFilterCompression { > 1: Uncompressed UNCOMPRESSED; > +2: CompressionCodec COMPRESSION; > } > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)