[GitHub] [parquet-mr] yabola commented on pull request #1023: PARQUET-2237 Improve performance when filters in RowGroupFilter can match exactly

via GitHub Mon, 06 Mar 2023 23:15:08 -0800


yabola commented on PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#issuecomment-1457669848


   @wgtmac @gszadovszky 
   I have a proposal to  automatically build BloomFilter with a more precise 
size. I create a jira https://issues.apache.org/jira/browse/PARQUET-2254 and  I 
hope to get your opinions, thank you.
   
   > Now the usage is to specify the size, and then build BloomFilter. In 
general scenarios, it is actually not sure how much the distinct value is.
   If BloomFilter can be automatically generated according to the data, the 
file size can be reduced and the reading efficiency can also be improved.
   
   I have an idea that the user can specify a maximum BloomFilter filter size, 
then we build several BloomFilter at the same time, we can use the largest 
BloomFilter as a counting tool( If there is no hit when inserting a value, the 
counter will be +1, of course this may be imprecise but enough)
   Then at the end of the write, choose a BloomFilter of a more appropriate 
size when the file is finally written.
   
   I want to implement this feature and


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-mr] yabola commented on pull request #1023: PARQUET-2237 Improve performance when filters in RowGroupFilter can match exactly

Reply via email to