[jira] [Comment Edited] (PARQUET-41) Add bloom filters to parquet statistics

Jim Apple (JIRA) Fri, 15 Jun 2018 07:29:32 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513883#comment-16513883
 ]


Jim Apple edited comment on PARQUET-41 at 6/15/18 2:28 PM:
-----------------------------------------------------------

I took a look at that benchmark and I now believe that in the case where the 
number of distinct values is the same as the number of values, or close to it, 
that this can provide some performance benefit.

Junjie also shared with me the following resources:

More benchmarks results: 
https://docs.google.com/spreadsheets/d/1LQqGZ1EQSkPBXtdi9nyANiQOhwNFwqiiFe8Sazclf5Y/edit#gid=0

Fork with BF enabled: 
https://github.com/cjjnjust/parquet-mr/tree/parquet-41-base-1.8.x

Data generator: https://github.com/cjjnjust/SQLDataGen



was (Author: jbapple):
I took a look at that benchmark and I now believe that in the case where the 
number of distinct values is the same as the number of values, or close to it, 
that this can provide some performance benefit.

> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Junjie Chen
>            Priority: Major
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (PARQUET-41) Add bloom filters to parquet statistics

Reply via email to