yabola commented on PR #1023:
URL: https://github.com/apache/parquet-mr/pull/1023#issuecomment-1457669848

   @wgtmac @gszadovszky 
   I have a proposal to  automatically build BloomFilter with a more precise 
size. I create a jira https://issues.apache.org/jira/browse/PARQUET-2254 and  I 
hope to get your opinions, thank you.
   
   > Now the usage is to specify the size, and then build BloomFilter. In 
general scenarios, it is actually not sure how much the distinct value is.
   If BloomFilter can be automatically generated according to the data, the 
file size can be reduced and the reading efficiency can also be improved.
   
   I have an idea that the user can specify a maximum BloomFilter filter size, 
then we build several BloomFilter at the same time, we can use the largest 
BloomFilter as a counting tool( If there is no hit when inserting a value, the 
counter will be +1, of course this may be imprecise but enough)
   Then at the end of the write, choose a BloomFilter of a more appropriate 
size when the file is finally written.
   
   I want to implement this feature and


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to