[ https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17702287#comment-17702287 ]
ASF GitHub Bot commented on PARQUET-2254: ----------------------------------------- yabola commented on PR #1042: URL: https://github.com/apache/parquet-mr/pull/1042#issuecomment-1475247698 https://github.com/apache/parquet-mr/blob/1235003e742e6a76bf6cb8f7ed33e942fa12d0d5/parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java#L197-L219 > Build a BloomFilter with a more precise size > -------------------------------------------- > > Key: PARQUET-2254 > URL: https://issues.apache.org/jira/browse/PARQUET-2254 > Project: Parquet > Issue Type: Improvement > Reporter: Mars > Assignee: Mars > Priority: Major > > Now the usage is to specify the size, and then build BloomFilter. In general > scenarios, it is actually not sure how much the distinct value is. > If BloomFilter can be automatically generated according to the data, the file > size can be reduced and the reading efficiency can also be improved. > I have an idea that the user can specify a maximum BloomFilter filter size, > then we build multiple BloomFilter at the same time, we can use the largest > BloomFilter as a counting tool( If there is no hit when inserting a value, > the counter will be +1, of course this may be imprecise but enough) > Then at the end of the write, choose a BloomFilter of a more appropriate size > when the file is finally written. > I want to implement this feature and hope to get your opinions, thank you -- This message was sent by Atlassian Jira (v8.20.10#820010)