[
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697987#comment-17697987
]
Mars commented on PARQUET-2254:
-------------------------------
[~wgtmac] [~gszadovszky]
1) This Jira is used to track the building of a more accurate size BloomFilter.
As the description says, this is my general idea, I will complete a version
first.
2) PARQUET-2237 is used to optimize RowGroupFilter.
As [~gszadovszky] said, I am also unsure wether it is better to check
dictionary first or check the bloom filter first or the other way around.
But I think one thing is sure, if there is a dictionary, no longer comparing
BloomFilter will definitely be better than the previous implementation.
As for the one step ahead, the order of checking bloom and dictionary , this
needs to be considered more.
3) in PARQUET-2237
I have a new idea now , if there are multiple filter predicates, such as `OR`
connected, we can optimize filter predicates one by one..
for example A>3 or B = 1 :
if in statistics filter, A>3 was determined impossible(drop), B =1 mightMatch
, the result of statistics filter is mightMatch
then in dictionary filter, we can only compare B and avoid comparing A to
optimize performance
> Build a BloomFilter with a more precise size
> --------------------------------------------
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
> Issue Type: Improvement
> Reporter: Mars
> Assignee: Mars
> Priority: Major
>
> Now the usage is to specify the size, and then build BloomFilter. In general
> scenarios, it is actually not sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file
> size can be reduced and the reading efficiency can also be improved.
> I have an idea that the user can specify a maximum BloomFilter filter size,
> then we build multiple BloomFilter at the same time, we can use the largest
> BloomFilter as a counting tool( If there is no hit when inserting a value,
> the counter will be +1, of course this may be imprecise but enough)
> Then at the end of the write, choose a BloomFilter of a more appropriate size
> when the file is finally written.
> I want to implement this feature and hope to get your opinions, thank you
--
This message was sent by Atlassian Jira
(v8.20.10#820010)