[ https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17702456#comment-17702456 ]
ASF GitHub Bot commented on PARQUET-2254: ----------------------------------------- yabola commented on PR #1042: URL: https://github.com/apache/parquet-mr/pull/1042#issuecomment-1475678398 5 candidates are set by default, and 2M storage is occupied by default (the previous default was 1M): `[expectedNDV=540000, numBytes=65536],[expectedNDV=108000, numBytes=131072],[expectedNDV=216500, numBytes=262144], [expectedNDV=433000, numBytes=524288],[expectedNDV=866000, numBytes=1048576]` > Build a BloomFilter with a more precise size > -------------------------------------------- > > Key: PARQUET-2254 > URL: https://issues.apache.org/jira/browse/PARQUET-2254 > Project: Parquet > Issue Type: Improvement > Reporter: Mars > Assignee: Mars > Priority: Major > > h3. Why are the changes needed? > Now the usage of bloom filter is to specify the NDV(number of distinct > values), and then build BloomFilter. In general scenarios, it is actually not > sure how much the distinct value is. > If BloomFilter can be automatically generated according to the data, the file > size can be reduced and the reading efficiency can also be improved. > h3. What changes were proposed in this pull request? > {{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as > candidates and inserts values in the candidates at the same time. Use the > largest bloom filter as an approximate deduplication counter, and then remove > incapable bloom filter candidates during data insertion. -- This message was sent by Atlassian Jira (v8.20.10#820010)