[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016790#comment-16016790
 ] 

Junjie Chen commented on PARQUET-41:
------------------------------------

Hi [~rdblue]
The distinct values in each column is increasing always. We may need to care 
more about the distinct values in a window such as row group or page. Take 
Telecom company as example, they produce about one row group (256MB) every 
minute, almost all records in this window are not repeated. Also smaller window 
may contains less repeated value while it needs more metadata overall.

As for effect of BF, it depends on the time spend on HDFS scan of a query, in 
other words, data scale. It takes about 5-6 minutes for a query w/o BF and 
takes 10+s with BF with one day Telecom data workload in a 8-nodes cluster. 

For optimization, I agree your point that bloom filter become useless with 
wrong config. It needs users understand their data clearly and set correct 
parameters. It also need to take into account in future that dynamically 
setting the bloom filter parameters according to sampling or change parameters 
at run time etc.. Right now, a 'static' BF should be a good option to users who 
know their data.


> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to