[ https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701637#comment-17701637 ]
ASF GitHub Bot commented on PARQUET-2256: ----------------------------------------- gszadovszky commented on PR #195: URL: https://github.com/apache/parquet-format/pull/195#issuecomment-1473613246 @mapleFU, I have discovered two unfortunate issues with the format definition of bloom filters that would be nice to be corrected before adding this change. (I am also fine solving these inside this PR.): * We should not copy-paste parts of the thift file in the documentation. Why whould we have them in two places? I would suggest only referencing the related thrift part from the bloom filter spec or simply remove the related part. * Since `CompressionCodec` already has a value of `UNCOMPRESSED` the enum `BloomFilterCompression` looks wierd. I do not have a much better solution for this since we must keep backward compatibility. What do you think about renaming `COMPRESSED` to `COMPRESSION_CODEC` and deprecate `BloomFilterCompression.UNCOMPRESSED` with a note to use `COMPRESSION_CODEC = UNCOMPRESSED` instead. (Of course from the implementation read point of view we need to handle both.) > Adding Compression for BloomFilter > ---------------------------------- > > Key: PARQUET-2256 > URL: https://issues.apache.org/jira/browse/PARQUET-2256 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Affects Versions: format-2.9.0 > Reporter: Xuwei Fu > Assignee: Xuwei Fu > Priority: Major > > In Current Parquet implementions, if BloomFilter doesn't set the ndv, most > implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is > 0.01, the BloomFilter size may grows to 2M for each column, which is really > huge. Should we support compression for BloomFilter, like: > > ``` > /** > * The compression used in the Bloom filter. > **/ > struct Uncompressed {} > union BloomFilterCompression { > 1: Uncompressed UNCOMPRESSED; > +2: CompressionCodec COMPRESSION; > } > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)