[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338670#comment-16338670 ]
Junjie Chen commented on PARQUET-41: ------------------------------------ Hi [~jbapple], AFAIK, we don't have benchmark progress to compare dic vs bloom yet. Just want to ask again, is benchmark meaningful? Dictionary filter is for the column with small cardinality, while bloom filter is for the column with large cardinality. A column with large cardinality can not even encode with dictionary due to benefit calculation logic, and bloom filter on a column with small cardinality obviously show less benefit than dictionary filter. > Add bloom filters to parquet statistics > --------------------------------------- > > Key: PARQUET-41 > URL: https://issues.apache.org/jira/browse/PARQUET-41 > Project: Parquet > Issue Type: New Feature > Components: parquet-format, parquet-mr > Reporter: Alex Levenson > Assignee: Ferdinand Xu > Priority: Major > Labels: filter2 > > For row groups with no dictionary, we could still produce a bloom filter. > This could be very useful in filtering entire row groups. > Pull request: > https://github.com/apache/parquet-mr/pull/215 -- This message was sent by Atlassian JIRA (v7.6.3#76005)