[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

Mars (Jira) Wed, 08 Mar 2023 08:09:13 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697987#comment-17697987
 ]


Mars commented on PARQUET-2254:
-------------------------------

[~wgtmac] [~gszadovszky] 
1) This Jira is used to track the building of a more accurate size BloomFilter. 
As the description says, this is my general idea, I will complete a version 
first.
2) PARQUET-2237 is used to optimize RowGroupFilter.
As [~gszadovszky]  said, I am also unsure wether it is better to check 
dictionary first or check the bloom filter first or the other way around.
But I think one thing is sure, if there is a dictionary, no longer comparing 
BloomFilter will definitely be better than the previous implementation.
As for the one step ahead, the order of checking bloom and dictionary , this 
needs to be considered more.

3) in PARQUET-2237
I have a new idea now , if there are multiple filter predicates, such as `OR` 
connected, we can optimize filter predicates one by one..
for example A>3 or B = 1 :
if in statistics filter,  A>3 was determined impossible(drop), B =1 mightMatch 
, the result of statistics filter is mightMatch
then in dictionary filter, we can only compare B and avoid comparing A to 
optimize performance

> Build a BloomFilter with a more precise size
> --------------------------------------------
>
>                 Key: PARQUET-2254
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2254
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Mars
>            Assignee: Mars
>            Priority: Major
>
> Now the usage is to specify the size, and then build BloomFilter. In general 
> scenarios, it is actually not sure how much the distinct value is. 
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> I have an idea that the user can specify a maximum BloomFilter filter size, 
> then we build multiple BloomFilter at the same time, we can use the largest 
> BloomFilter as a counting tool( If there is no hit when inserting a value, 
> the counter will be +1, of course this may be imprecise but enough)
> Then at the end of the write, choose a BloomFilter of a more appropriate size 
> when the file is finally written.
> I want to implement this feature and hope to get your opinions, thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

Reply via email to