[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018218#comment-16018218
 ] 

Junjie Chen edited comment on PARQUET-41 at 5/22/17 6:26 AM:
-------------------------------------------------------------

Hi [~rdblue]
In telecom example, query column is not unique if someone has two calls in 
short time such as 1 minute. The number of rows in telecom example query 
depends on the how frequency the subscriber been recorded and the number of 
columns in the telecom example query is 90+ . In the telecom example query 
tests, the false positive parameter was set to 0.05.

About the difference of time speed up, I think is how many columns in table. In 
your calculation, the bloom filter is 10% of the data, but the proportion would 
be significant reduced in a 90+ columns table in telecom example (the time 
spend in read column maybe not significant reduced due to 
vectorization/projection, but the size of bloom filter in a row group can be 
significantly shrink ). 

Multiple columns tables are very common in many customers, based on your 
calculation, we can spend very small size of index statistic to achieve at 
least 5 times speedup in HDFS scan stage in big tables with multiple columns if 
the data in row group is unique.





was (Author: junjie):
Hi [~rdblue]
In telecom example, query column is not unique if someone has two calls in 
short time such as 1 minutes. The number of row output depends on the how 
frequency the subscriber been recorded. In my test, the false positive was set 
to 0.05.

About the difference of time speed up, I think is how many columns in table. In 
your calculation, the bloom filter is 10% of the data, but the proportion would 
be significant reduced in a 90+ columns table in telecom example (the time 
spend in read column maybe not significant reduced due to 
vectorization/projection, but the size of bloom filter in a row group can be 
significantly shrink ). 

Multiple columns tables are very common in many customers, based on your 
calculation, we can spend very small size of index statistic to achieve at 
least 5 times speedup in HDFS scan stage in big tables with multiple columns if 
the data in row group is unique.




> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to