[ https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17708215#comment-17708215 ]
ASF GitHub Bot commented on PARQUET-2256: ----------------------------------------- mapleFU commented on PR #195: URL: https://github.com/apache/parquet-format/pull/195#issuecomment-1495352894 Hi @gszadovszky @wgtmac . I was busy these weeks, so maybe a bit late here. Though https://issues.apache.org/jira/browse/HUDI-558 says that using compression on BloomFilter works well, after take some experiments, I found that: 1. When BloomFilter is "sparse" ( which means filter is big, and filling value is small ), compression may works well 2. When BloomFilter hash size is similar with ndv, the zstd, lz4, snappy don't works well. ``` size: 100, fpp: 0.1 Hash Ndv: 100 Size before compress: 144 zstd compressed length:153 lz4 compressed length:141 snappy compressed length:143 size: 1000, fpp: 0.1 Hash Ndv: 1000 Size before compress: 1040 zstd compressed length:1050 lz4 compressed length:1040 snappy compressed length:1040 size: 10000, fpp: 0.1 Hash Ndv: 10000 Size before compress: 8209 zstd compressed length:7314 lz4 compressed length:8238 snappy compressed length:8209 size: 100000, fpp: 0.1 Hash Ndv: 100000 Size before compress: 131089 zstd compressed length:131104 lz4 compressed length:131599 snappy compressed length:131094 size: 500000, fpp: 0.1 Hash Ndv: 500000 Size before compress: 524305 zstd compressed length:506198 lz4 compressed length:526357 snappy compressed length:524328 size: 100, fpp: 0.05 Hash Ndv: 100 Size before compress: 144 zstd compressed length:153 lz4 compressed length:141 snappy compressed length:143 size: 1000, fpp: 0.05 Hash Ndv: 1000 Size before compress: 1040 zstd compressed length:1050 lz4 compressed length:1040 snappy compressed length:1040 size: 10000, fpp: 0.05 Hash Ndv: 10000 Size before compress: 16401 zstd compressed length:16411 lz4 compressed length:16462 snappy compressed length:16402 size: 100000, fpp: 0.05 Hash Ndv: 100000 Size before compress: 131089 zstd compressed length:131104 lz4 compressed length:131599 snappy compressed length:131094 size: 500000, fpp: 0.05 Hash Ndv: 500000 Size before compress: 524305 zstd compressed length:506198 lz4 compressed length:526357 snappy compressed length:524328 size: 100, fpp: 0.01 Hash Ndv: 100 Size before compress: 144 zstd compressed length:153 lz4 compressed length:141 snappy compressed length:143 size: 1000, fpp: 0.01 Hash Ndv: 1000 Size before compress: 2064 zstd compressed length:2074 lz4 compressed length:2068 snappy compressed length:2064 size: 10000, fpp: 0.01 Hash Ndv: 10000 Size before compress: 16401 zstd compressed length:16411 lz4 compressed length:16462 snappy compressed length:16402 size: 100000, fpp: 0.01 Hash Ndv: 100000 Size before compress: 131089 zstd compressed length:131104 lz4 compressed length:131599 snappy compressed length:131094 size: 500000, fpp: 0.01 Hash Ndv: 500000 Size before compress: 1048594 zstd compressed length:1007916 lz4 compressed length:1052703 snappy compressed length:1048641 size: 100, fpp: 0.005 Hash Ndv: 100 Size before compress: 272 zstd compressed length:282 lz4 compressed length:269 snappy compressed length:272 size: 1000, fpp: 0.005 Hash Ndv: 1000 Size before compress: 2064 zstd compressed length:2074 lz4 compressed length:2068 snappy compressed length:2064 size: 10000, fpp: 0.005 Hash Ndv: 10000 Size before compress: 16401 zstd compressed length:16411 lz4 compressed length:16462 snappy compressed length:16402 size: 100000, fpp: 0.005 Hash Ndv: 100000 Size before compress: 262161 zstd compressed length:238073 lz4 compressed length:263185 snappy compressed length:262172 size: 500000, fpp: 0.005 Hash Ndv: 500000 Size before compress: 1048594 zstd compressed length:1007916 lz4 compressed length:1052703 snappy compressed length:1048641 ``` So I think maybe it's not a good idea for compress SPBF if ndv estimation is great. When size estimation get worse ( if estimated size is 8 times larger than real hash-value size), the compression rate will goes to 50% for zstd, lz4. > Adding Compression for BloomFilter > ---------------------------------- > > Key: PARQUET-2256 > URL: https://issues.apache.org/jira/browse/PARQUET-2256 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Affects Versions: format-2.9.0 > Reporter: Xuwei Fu > Assignee: Xuwei Fu > Priority: Major > > In Current Parquet implementions, if BloomFilter doesn't set the ndv, most > implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is > 0.01, the BloomFilter size may grows to 2M for each column, which is really > huge. Should we support compression for BloomFilter, like: > > ``` > /** > * The compression used in the Bloom filter. > **/ > struct Uncompressed {} > union BloomFilterCompression { > 1: Uncompressed UNCOMPRESSED; > +2: CompressionCodec COMPRESSION; > } > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)