markharw00d wrote:
Andrzej Bialecki wrote:
Funny, I was having vague thoughts about this today too having been
concerned about some of the big arrays that can end up in a typical
Lucene app. Aside from providing space-efiicient lookups, another
application for BloomFilters is in similarity measures e.g. ANDing 2
BloomFilters can provide the basis for a rough measure of similarity.
Yes, it's an intriguing thought - worth checking how well it works in
practice.
There are some other cool things you can do with BloomFilter-s:
* a counting BloomFilter allows both adding and deleting the keys, and
the estimation of frequency of a key (if the same key is added multiple
times).
* spectral BloomFilters provide a way to store a (small) exact value
within a filter, so that when you test the membership you can also
retrieve the value corresponding to that key (perhaps good for idf?).
This could be potentially useful in many places, but one specific
application is outlined here: http://markmail.org/message/xyqdz3go6jwu4ozm
* if you use BloomFilter just for quick lookup, but you want to
eliminate false positives, it's possible to probe the main filter using
all keys from the keyspace, and build a small table of exceptions that
eliminates false positives from the first filter.
* dynamic BloomFilter-s don't require that you know the number of keys
in advance. This could be useful e.g. to quickly eliminate nonexistent
terms from a query, without actually looking up the query terms in the
dictionary - just maintain a dynamic BloomFilter of all terms in the index.
* another thought: we could create a BloomFilter from all terms in a
document, and store it as a field in a document. I don't know yet what
for ... :) well, are there any scenarios when you test the document
whether it contains a list of words? perhaps spam/porn detection.
Keep the ideas coming ... I have a feeling that eventually something
useful will come out of it.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]