Hi all,

I've been using BloomFilters for various tasks, and I can't shake the feeling that they could be of some use in Lucene internals, to speed up various membership tests, especially if we look for 100% correct negatives, and we can accept a small rate of false positives.

For example, let's consider the recent discussion about deleted docs. If we maintained a simple BloomFilter for deleted docs (per segment), then we could say for sure when a document is _not_ deleted. Since segments usually contain small percentage of deleted docs this should tremendously speed up the check. If the test returns true, then we need to actually look up the id in the list of deleted docs (because of a small chance of false positives - but this perhaps can be acceptable in some situations?).

Another half-formed idea is to somehow use them in Filter-s, to save on space and perhaps on lookup time, at the cost of some false positives.

I suspect there could be other places where we can use this structure. It doesn't have to be anything sophisticated or large - in some contexts it's perhaps enough to have a BloomFilter consisting of a single (long) value, plus a trivial hashing scheme for setting and probing bits in that long value.

Sorry if this sounds too vague, it's just some food for thought ...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to