[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338670#comment-16338670
 ] 

Junjie Chen commented on PARQUET-41:
------------------------------------

Hi [~jbapple], AFAIK, we don't have benchmark progress to compare dic vs bloom 
yet. Just want to ask again, is benchmark meaningful? Dictionary filter is for 
the column with small cardinality,  while bloom filter is for the column with 
large cardinality.  A column with large cardinality can not even encode with 
dictionary due to benefit calculation logic, and bloom filter on a column with 
small cardinality obviously show less benefit than dictionary filter. 

> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>            Priority: Major
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to