[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605183#comment-14605183
]
Ferdinand Xu commented on PARQUET-41:
-------------------------------------
Hi [~rdblue], really appreciate for your long comments and the concrete data.
To ensure I follow your points well, I’d like to make a short summary at first.
For current solution or design, we get two cons. The first is taking the space
efficient as a consideration. According to the calculations in the sheet, the
bloom filter bit set will occupied much more space than expected. The second is
about the approach to obtain the setting for the expected number of entries.
For the first one, I am thinking about adding a header (kind of Statistics
header) as the dictionary did. We may create a map-like data structure with
datapage as key and bloom filter as value. Just a rough idea and more
investigation needed here. WRT the setting of expected numbers, I like your
second idea too. We could obtain it at the runtime and write to the bloom
filter when flush happened. Any thoughts here?
Thank you!
Ferd
> Add bloom filters to parquet statistics
> ---------------------------------------
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-format, parquet-mr
> Reporter: Alex Levenson
> Assignee: Ferdinand Xu
> Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter.
> This could be very useful in filtering entire row groups.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)