kangpinghuang commented on issue #1738: add bloom filter index URL: https://github.com/apache/incubator-doris/pull/1738#issuecomment-530704672 > Besides implementation, I have several questions about the design > > 1. Why do we choose to store one BF per column per row block (1024 rows)? This seems to have a fairly high storage and search overhead. For example, let's say a column contains 10M rows and thus 10K blocks. Under the default fpp which 5%, the size of each BF is around 800 bytes. So the total size of this BF index is ~8MB which is very large(for comparison, the uncompressed size of int column with 10M rows is 40MB). To evaluate a "in" predicate with `n` values, `10K` BFs need to be tested on `n` values, which could also take a long time. > 2. Size of BF depends on `expected_entries`, which is the number of unique values that will get inserted. Currently `expected_entries` is set to num_rows_per_block, which is too large for row block that contains many repeated values. How about allocating one BF for k distinct values instead of k rows? We can use the BF we're building to de-duplicate inputs. > 3. Classic BF (the one Doris is currently using) is known to be cache unfriendly. Have you considered BlockedBloomFilter which is used by Parquet and RocksDB? For more information on BlockedBloomFilter, you can refer to https://github.com/apache/parquet-format/blob/master/BloomFilter.md good suggestion! Tks
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
