[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018218#comment-16018218
]
Junjie Chen edited comment on PARQUET-41 at 5/22/17 6:26 AM:
-------------------------------------------------------------
Hi [~rdblue]
In telecom example, query column is not unique if someone has two calls in
short time such as 1 minute. The number of rows in telecom example query
depends on the how frequency the subscriber been recorded and the number of
columns in the telecom example query is 90+ . In the telecom example query
tests, the false positive parameter was set to 0.05.
About the difference of time speed up, I think is how many columns in table. In
your calculation, the bloom filter is 10% of the data, but the proportion would
be significant reduced in a 90+ columns table in telecom example (the time
spend in read column maybe not significant reduced due to
vectorization/projection, but the size of bloom filter in a row group can be
significantly shrink ).
Multiple columns tables are very common in many customers, based on your
calculation, we can spend very small size of index statistic to achieve at
least 5 times speedup in HDFS scan stage in big tables with multiple columns if
the data in row group is unique.
was (Author: junjie):
Hi [~rdblue]
In telecom example, query column is not unique if someone has two calls in
short time such as 1 minutes. The number of row output depends on the how
frequency the subscriber been recorded. In my test, the false positive was set
to 0.05.
About the difference of time speed up, I think is how many columns in table. In
your calculation, the bloom filter is 10% of the data, but the proportion would
be significant reduced in a 90+ columns table in telecom example (the time
spend in read column maybe not significant reduced due to
vectorization/projection, but the size of bloom filter in a row group can be
significantly shrink ).
Multiple columns tables are very common in many customers, based on your
calculation, we can spend very small size of index statistic to achieve at
least 5 times speedup in HDFS scan stage in big tables with multiple columns if
the data in row group is unique.
> Add bloom filters to parquet statistics
> ---------------------------------------
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-format, parquet-mr
> Reporter: Alex Levenson
> Assignee: Ferdinand Xu
> Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter.
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)