Mars created PARQUET-2254: ----------------------------- Summary: Build a BloomFilter with a more precise size Key: PARQUET-2254 URL: https://issues.apache.org/jira/browse/PARQUET-2254 Project: Parquet Issue Type: Improvement Reporter: Mars
Now the usage is to specify the size, and then build BloomFilter. In general scenarios, it is actually not sure how much the distinct value is. If BloomFilter can be automatically generated according to the data, the file size can be reduced and the reading efficiency can also be improved. I have an idea that the user can specify a maximum BloomFilter filter size, then we build several BloomFilter at the same time, we can use the largest BloomFilter as a counting tool( If there is no hit when inserting a value, the counter will be +1, of course this may be imprecise but enough) Then at the end of the write, choose a BloomFilter of a more appropriate size when the file is finally written. -- This message was sent by Atlassian Jira (v8.20.10#820010)