[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15148130#comment-15148130
 ] 

Ferdinand Xu commented on PARQUET-41:
-------------------------------------

Hi [~rdblue],
I have a basic idea about how to estimate the expected entries required by 
bloom filter. 
AFAIK we can’t get the row count for each row group before all data are flushed 
into the disk. Since this we can estimate the size in the following way.
For the first row group, we don’t create bloom filter statistics for it at the 
beginning. By flushing the first row group, we’re able to have a general idea 
of the row counts. For the rest of the row groups, we will choose this row 
count to create the bloom filter bit set. 
We can do a small improvement for the strategy above. We have the size for the 
whole row group. We can calculate the expected entry number based on the 
average size for the first 100 or 1000 rows. Since the characteristic of bloom 
filter, we need to store these items in a tmp buffer. Once the bloom filter bit 
set is created, we will flush these data into bit set and then drop them.
One thing I want to highlight is that we don’t need to know the *exact* row 
count and an estimated value is enough. 
Any thoughts about the idea?


> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to