[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018218#comment-16018218
]
Junjie Chen edited comment on PARQUET-41 at 5/20/17 12:58 AM:
--------------------------------------------------------------
Hi [~rdblue]
In telecom example, query column is not unique if someone has two calls in
short time such as 1 minutes. The number of row output depends on the how
frequency the subscriber been recorded. In my test, the false positive was set
to 0.05.
About the difference of time speed up, I think is how many columns in table. In
your calculation, the bloom filter is 10% of the data, but the proportion would
be significant reduced in a 90+ columns table in telecom example (the time
spend in read column maybe not significant reduced if parquet vectorization is
enabled, but the size of bloom filter in a row group can be significantly
shrink ).
Multiple columns tables are very common in many customers, based on your
calculation, we can spend very small size of index statistic to achieve at
least 5 times speedup in HDFS scan stage in big tables with multiple columns
and with a ~unique column.
was (Author: junjie):
Hi [~rdblue]
In telecom example, query column is not unique if someone has two calls in
short time such as 1 minutes. The number of row output depends on the how
frequency the subscriber been recorded. In my test, the false positive was set
to 0.05.
About the difference of time speed up, I think is how many columns in table. In
your calculation, the bloom filter is 10% of the data, but the proportion would
be significant reduced in a 90+ columns table in telecom example (the time
spend in read column maybe not significant reduced if parquet vectorization is
enabled, but the size of bloom filter in a row group can be significantly
shrink ).
Multiple columns tables are very common in many customers, based on your
calculation, we can spend very small size of index statistic to achieve at
least 5 times speedup in HDFS scan stage in big tables with multiple columns.
> Add bloom filters to parquet statistics
> ---------------------------------------
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-format, parquet-mr
> Reporter: Alex Levenson
> Assignee: Ferdinand Xu
> Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter.
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)