Github user ravipesala commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2324#discussion_r189647876
--- Diff:
datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomDataMapWriter.java
---
@@ -86,12 +86,31 @@ public void onBlockletStart(int blockletId) {
protected void resetBloomFilters() {
indexBloomFilters.clear();
List<CarbonColumn> indexColumns = getIndexColumns();
+ int[] stats = calculateBloomStats();
for (int i = 0; i < indexColumns.size(); i++) {
- indexBloomFilters.add(BloomFilter.create(Funnels.byteArrayFunnel(),
- bloomFilterSize, bloomFilterFpp));
+ indexBloomFilters
+ .add(new CarbonBloomFilter(stats[0], stats[1], Hash.MURMUR_HASH,
compressBloom));
}
}
+ /**
+ * It calculates the bits size and number of hash functions to calculate
bloom.
+ */
+ private int[] calculateBloomStats() {
+ /*
+ * n: how many items you expect to have in your filter
+ * p: your acceptable false positive rate
+ * Number of bits (m) = -n*ln(p) / (ln(2)^2)
+ * Number of hashes(k) = m/n * ln(2)
--- End diff --
Can't as `k` is dependent on `m`
---