Hi all,
I've been using BloomFilters for various tasks, and I can't shake the
feeling that they could be of some use in Lucene internals, to speed up
various membership tests, especially if we look for 100% correct
negatives, and we can accept a small rate of false positives.
For example, let's consider the recent discussion about deleted docs. If
we maintained a simple BloomFilter for deleted docs (per segment), then
we could say for sure when a document is _not_ deleted. Since segments
usually contain small percentage of deleted docs this should
tremendously speed up the check. If the test returns true, then we need
to actually look up the id in the list of deleted docs (because of a
small chance of false positives - but this perhaps can be acceptable in
some situations?).
Another half-formed idea is to somehow use them in Filter-s, to save on
space and perhaps on lookup time, at the cost of some false positives.
I suspect there could be other places where we can use this structure.
It doesn't have to be anything sophisticated or large - in some contexts
it's perhaps enough to have a BloomFilter consisting of a single (long)
value, plus a trivial hashing scheme for setting and probing bits in
that long value.
Sorry if this sounds too vague, it's just some food for thought ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org