[
https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17708215#comment-17708215
]
ASF GitHub Bot commented on PARQUET-2256:
-----------------------------------------
mapleFU commented on PR #195:
URL: https://github.com/apache/parquet-format/pull/195#issuecomment-1495352894
Hi @gszadovszky @wgtmac . I was busy these weeks, so maybe a bit late here.
Though https://issues.apache.org/jira/browse/HUDI-558 says that using
compression on BloomFilter works well, after take some experiments, I found
that:
1. When BloomFilter is "sparse" ( which means filter is big, and filling
value is small ), compression may works well
2. When BloomFilter hash size is similar with ndv, the zstd, lz4, snappy
don't works well.
```
size: 100, fpp: 0.1
Hash Ndv: 100
Size before compress: 144 zstd compressed length:153
lz4 compressed length:141
snappy compressed length:143
size: 1000, fpp: 0.1
Hash Ndv: 1000
Size before compress: 1040 zstd compressed length:1050
lz4 compressed length:1040
snappy compressed length:1040
size: 10000, fpp: 0.1
Hash Ndv: 10000
Size before compress: 8209 zstd compressed length:7314
lz4 compressed length:8238
snappy compressed length:8209
size: 100000, fpp: 0.1
Hash Ndv: 100000
Size before compress: 131089 zstd compressed length:131104
lz4 compressed length:131599
snappy compressed length:131094
size: 500000, fpp: 0.1
Hash Ndv: 500000
Size before compress: 524305 zstd compressed length:506198
lz4 compressed length:526357
snappy compressed length:524328
size: 100, fpp: 0.05
Hash Ndv: 100
Size before compress: 144 zstd compressed length:153
lz4 compressed length:141
snappy compressed length:143
size: 1000, fpp: 0.05
Hash Ndv: 1000
Size before compress: 1040 zstd compressed length:1050
lz4 compressed length:1040
snappy compressed length:1040
size: 10000, fpp: 0.05
Hash Ndv: 10000
Size before compress: 16401 zstd compressed length:16411
lz4 compressed length:16462
snappy compressed length:16402
size: 100000, fpp: 0.05
Hash Ndv: 100000
Size before compress: 131089 zstd compressed length:131104
lz4 compressed length:131599
snappy compressed length:131094
size: 500000, fpp: 0.05
Hash Ndv: 500000
Size before compress: 524305 zstd compressed length:506198
lz4 compressed length:526357
snappy compressed length:524328
size: 100, fpp: 0.01
Hash Ndv: 100
Size before compress: 144 zstd compressed length:153
lz4 compressed length:141
snappy compressed length:143
size: 1000, fpp: 0.01
Hash Ndv: 1000
Size before compress: 2064 zstd compressed length:2074
lz4 compressed length:2068
snappy compressed length:2064
size: 10000, fpp: 0.01
Hash Ndv: 10000
Size before compress: 16401 zstd compressed length:16411
lz4 compressed length:16462
snappy compressed length:16402
size: 100000, fpp: 0.01
Hash Ndv: 100000
Size before compress: 131089 zstd compressed length:131104
lz4 compressed length:131599
snappy compressed length:131094
size: 500000, fpp: 0.01
Hash Ndv: 500000
Size before compress: 1048594 zstd compressed length:1007916
lz4 compressed length:1052703
snappy compressed length:1048641
size: 100, fpp: 0.005
Hash Ndv: 100
Size before compress: 272 zstd compressed length:282
lz4 compressed length:269
snappy compressed length:272
size: 1000, fpp: 0.005
Hash Ndv: 1000
Size before compress: 2064 zstd compressed length:2074
lz4 compressed length:2068
snappy compressed length:2064
size: 10000, fpp: 0.005
Hash Ndv: 10000
Size before compress: 16401 zstd compressed length:16411
lz4 compressed length:16462
snappy compressed length:16402
size: 100000, fpp: 0.005
Hash Ndv: 100000
Size before compress: 262161 zstd compressed length:238073
lz4 compressed length:263185
snappy compressed length:262172
size: 500000, fpp: 0.005
Hash Ndv: 500000
Size before compress: 1048594 zstd compressed length:1007916
lz4 compressed length:1052703
snappy compressed length:1048641
```
So I think maybe it's not a good idea for compress SPBF if ndv estimation is
great. When size estimation get worse ( if estimated size is 8 times larger
than real hash-value size), the compression rate will goes to 50% for zstd, lz4.
> Adding Compression for BloomFilter
> ----------------------------------
>
> Key: PARQUET-2256
> URL: https://issues.apache.org/jira/browse/PARQUET-2256
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Affects Versions: format-2.9.0
> Reporter: Xuwei Fu
> Assignee: Xuwei Fu
> Priority: Major
>
> In Current Parquet implementions, if BloomFilter doesn't set the ndv, most
> implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is
> 0.01, the BloomFilter size may grows to 2M for each column, which is really
> huge. Should we support compression for BloomFilter, like:
>
> ```
> /**
> * The compression used in the Bloom filter.
> **/
> struct Uncompressed {}
> union BloomFilterCompression {
> 1: Uncompressed UNCOMPRESSED;
> +2: CompressionCodec COMPRESSION;
> }
> ```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)