[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513883#comment-16513883
]
Jim Apple edited comment on PARQUET-41 at 6/15/18 2:28 PM:
-----------------------------------------------------------
I took a look at that benchmark and I now believe that in the case where the
number of distinct values is the same as the number of values, or close to it,
that this can provide some performance benefit.
Junjie also shared with me the following resources:
More benchmarks results:
https://docs.google.com/spreadsheets/d/1LQqGZ1EQSkPBXtdi9nyANiQOhwNFwqiiFe8Sazclf5Y/edit#gid=0
Fork with BF enabled:
https://github.com/cjjnjust/parquet-mr/tree/parquet-41-base-1.8.x
Data generator: https://github.com/cjjnjust/SQLDataGen
was (Author: jbapple):
I took a look at that benchmark and I now believe that in the case where the
number of distinct values is the same as the number of values, or close to it,
that this can provide some performance benefit.
> Add bloom filters to parquet statistics
> ---------------------------------------
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-format, parquet-mr
> Reporter: Alex Levenson
> Assignee: Junjie Chen
> Priority: Major
> Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter.
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)