[ 
https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17708215#comment-17708215
 ] 

ASF GitHub Bot commented on PARQUET-2256:
-----------------------------------------

mapleFU commented on PR #195:
URL: https://github.com/apache/parquet-format/pull/195#issuecomment-1495352894

   Hi @gszadovszky @wgtmac . I was busy these weeks, so maybe a bit late here.
   
   Though https://issues.apache.org/jira/browse/HUDI-558 says that using 
compression on BloomFilter works well, after take some experiments, I found 
that:
   1. When BloomFilter is "sparse" ( which means filter is big, and filling 
value is small ), compression may works well
   2. When BloomFilter hash size is similar with ndv, the zstd, lz4, snappy 
don't works well.
   
   ```
   size: 100, fpp: 0.1
   Hash Ndv: 100
   Size before compress: 144 zstd compressed length:153
   lz4 compressed length:141
   snappy compressed length:143
   
   
   size: 1000, fpp: 0.1
   Hash Ndv: 1000
   Size before compress: 1040 zstd compressed length:1050
   lz4 compressed length:1040
   snappy compressed length:1040
   
   
   size: 10000, fpp: 0.1
   Hash Ndv: 10000
   Size before compress: 8209 zstd compressed length:7314
   lz4 compressed length:8238
   snappy compressed length:8209
   
   
   size: 100000, fpp: 0.1
   Hash Ndv: 100000
   Size before compress: 131089 zstd compressed length:131104
   lz4 compressed length:131599
   snappy compressed length:131094
   
   
   size: 500000, fpp: 0.1
   Hash Ndv: 500000
   Size before compress: 524305 zstd compressed length:506198
   lz4 compressed length:526357
   snappy compressed length:524328
   
   
   size: 100, fpp: 0.05
   Hash Ndv: 100
   Size before compress: 144 zstd compressed length:153
   lz4 compressed length:141
   snappy compressed length:143
   
   
   size: 1000, fpp: 0.05
   Hash Ndv: 1000
   Size before compress: 1040 zstd compressed length:1050
   lz4 compressed length:1040
   snappy compressed length:1040
   
   
   size: 10000, fpp: 0.05
   Hash Ndv: 10000
   Size before compress: 16401 zstd compressed length:16411
   lz4 compressed length:16462
   snappy compressed length:16402
   
   
   size: 100000, fpp: 0.05
   Hash Ndv: 100000
   Size before compress: 131089 zstd compressed length:131104
   lz4 compressed length:131599
   snappy compressed length:131094
   
   
   size: 500000, fpp: 0.05
   Hash Ndv: 500000
   Size before compress: 524305 zstd compressed length:506198
   lz4 compressed length:526357
   snappy compressed length:524328
   
   
   size: 100, fpp: 0.01
   Hash Ndv: 100
   Size before compress: 144 zstd compressed length:153
   lz4 compressed length:141
   snappy compressed length:143
   
   
   size: 1000, fpp: 0.01
   Hash Ndv: 1000
   Size before compress: 2064 zstd compressed length:2074
   lz4 compressed length:2068
   snappy compressed length:2064
   
   
   size: 10000, fpp: 0.01
   Hash Ndv: 10000
   Size before compress: 16401 zstd compressed length:16411
   lz4 compressed length:16462
   snappy compressed length:16402
   
   
   size: 100000, fpp: 0.01
   Hash Ndv: 100000
   Size before compress: 131089 zstd compressed length:131104
   lz4 compressed length:131599
   snappy compressed length:131094
   
   
   size: 500000, fpp: 0.01
   Hash Ndv: 500000
   Size before compress: 1048594 zstd compressed length:1007916
   lz4 compressed length:1052703
   snappy compressed length:1048641
   
   
   size: 100, fpp: 0.005
   Hash Ndv: 100
   Size before compress: 272 zstd compressed length:282
   lz4 compressed length:269
   snappy compressed length:272
   
   
   size: 1000, fpp: 0.005
   Hash Ndv: 1000
   Size before compress: 2064 zstd compressed length:2074
   lz4 compressed length:2068
   snappy compressed length:2064
   
   size: 10000, fpp: 0.005
   Hash Ndv: 10000
   Size before compress: 16401 zstd compressed length:16411
   lz4 compressed length:16462
   snappy compressed length:16402
   
   size: 100000, fpp: 0.005
   Hash Ndv: 100000
   Size before compress: 262161 zstd compressed length:238073
   lz4 compressed length:263185
   snappy compressed length:262172
   
   size: 500000, fpp: 0.005
   Hash Ndv: 500000
   Size before compress: 1048594 zstd compressed length:1007916
   lz4 compressed length:1052703
   snappy compressed length:1048641
   ```
    
   So I think maybe it's not a good idea for compress SPBF if ndv estimation is 
great. When size estimation get worse ( if estimated size is 8 times larger 
than real hash-value size), the compression rate will goes to 50% for zstd, lz4.




> Adding Compression for BloomFilter
> ----------------------------------
>
>                 Key: PARQUET-2256
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2256
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>    Affects Versions: format-2.9.0
>            Reporter: Xuwei Fu
>            Assignee: Xuwei Fu
>            Priority: Major
>
> In Current Parquet implementions, if BloomFilter doesn't set the ndv, most 
> implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is 
> 0.01, the BloomFilter size may grows to 2M for each column, which is really 
> huge. Should we support compression for BloomFilter, like:
>  
> ```
>  /**
>  * The compression used in the Bloom filter.
>  **/
> struct Uncompressed {}
> union BloomFilterCompression {
>   1: Uncompressed UNCOMPRESSED;
> +2: CompressionCodec COMPRESSION;
> }
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to