[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15009981#comment-15009981
]
Ferdinand Xu commented on PARQUET-41:
-------------------------------------
Thank you for your feedback. I worked on the patch working on CDH 5.5. It
brings ~2.6X performance improvement at the cost of 3% extra space when
executing query on a data set of 1.5G if I disable the min/max statistics.
Since there's a big divergence in the code base between CDH 5.5 and the master,
some further work are needed to make it work on the upstream. I think we need
not calculate the number of unique value since the same value will result in
the same hash value by executing the same hash functions. If the expected
number is higher than the real value, it should be OK. So the problem will
change to how we get the total number of pages. I will think about it and work
on the design document. Thank you!
> Add bloom filters to parquet statistics
> ---------------------------------------
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-format, parquet-mr
> Reporter: Alex Levenson
> Assignee: Ferdinand Xu
> Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter.
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)