[ 
https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17709515#comment-17709515
 ] 

ASF GitHub Bot commented on PARQUET-2256:
-----------------------------------------

emkornfield commented on PR #195:
URL: https://github.com/apache/parquet-format/pull/195#issuecomment-1499711125

   > Thats a good question, the bytes are include the header. But I think 
compress is powerful, including a header should not affect the compression rate 
greatly
   
   makes sense, I couldn't come up with anything better here.  I think it is 
probably OK to still have compression even if it doesn't work for NDV, it seems 
like writers can decide whether to right compressed vs uncompressed based on 
final size.




> Adding Compression for BloomFilter
> ----------------------------------
>
>                 Key: PARQUET-2256
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2256
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>    Affects Versions: format-2.9.0
>            Reporter: Xuwei Fu
>            Assignee: Xuwei Fu
>            Priority: Major
>
> In Current Parquet implementions, if BloomFilter doesn't set the ndv, most 
> implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is 
> 0.01, the BloomFilter size may grows to 2M for each column, which is really 
> huge. Should we support compression for BloomFilter, like:
>  
> ```
>  /**
>  * The compression used in the Bloom filter.
>  **/
> struct Uncompressed {}
> union BloomFilterCompression {
>   1: Uncompressed UNCOMPRESSED;
> +2: CompressionCodec COMPRESSION;
> }
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to