kangpinghuang commented on issue #1738: add bloom filter index
URL: https://github.com/apache/incubator-doris/pull/1738#issuecomment-530704672
 
 
   > Besides implementation, I have several questions about the design
   > 
   > 1. Why do we choose to store one BF per column per row block (1024 rows)? 
This seems to have a fairly high storage and search overhead. For example, 
let's say a column contains 10M rows and thus 10K blocks. Under the default fpp 
which 5%, the size of each BF is around  800 bytes. So the total size of this 
BF index is ~8MB which is very large(for comparison, the uncompressed size of 
int column with 10M rows is 40MB). To evaluate a "in" predicate with `n` 
values, `10K` BFs need to be tested on `n` values, which could also take a long 
time.
   > 2. Size of BF depends on `expected_entries`, which is the number of unique 
values that will get inserted. Currently `expected_entries` is set to 
num_rows_per_block, which is too large for row block that contains many 
repeated values. How about allocating one BF for k distinct values instead of k 
rows? We can use the BF we're building to de-duplicate inputs.
   > 3. Classic BF (the one Doris is currently using) is known to be cache 
unfriendly. Have you considered BlockedBloomFilter which is used by Parquet and 
RocksDB? For more information on BlockedBloomFilter, you can refer to 
https://github.com/apache/parquet-format/blob/master/BloomFilter.md
   
   good suggestion! Tks

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to