Is there a good aspect to reporting false positives? On Mon, Nov 2, 2015 at 10:31 AM, 'Jon Hough' via Programming < programm...@jsoftware.com> wrote:
> > I thought I would have a go at creating a very simple bloom filter in J. I > used a boolean array as the backing data structure. It seems to work ok: > Bloom filters are nice data structures for holding lossy information about > items you stick in them. You can't retrieve the items but you can perform a > test to see if the item has been placed into the bloom filter previously. > An interesting property is the imperfect nature of the test. It will return > false positives sometimes, items that were not placed into the bloom filter > but the test returns a positive result anyway. > One use I know of is for spell checkers. You don't need to have an actual > array / map of correctly spelled words - just a bloom filter that you can > test new words against. Anyway, this is my implementation (the hash > function is just a simple test function - probably better to use MD5 or > murmur or something). > > > NB. bloom filter > > filtersize=: 100 > keysize=:50 > > NB. the bloom filter expressed as a boolean array > BF=:filtersize $ 0 > > NB. silly hash -- used for simple testing > hash=: > filtersize&|@:(29211&XOR)@:((3020&AND)+(1091&XOR))@:(+/)@:(a.&i.)@:": > > lrhash =: hash@:(":,":) > > > NB. add an item to the bloom filter. > NB. Create an array of size 'keysize' > NB. and do multiple hashing for each > NB. item. These hashes are the indices > NB. to be inserted into the bloom filter. > additem=: 3 : 0 > l=:(hash y),((<:keysize)$0) > indices=:lrhash/\ l > BF =:1 indices} BF > ) > > > NB. Test if y is an item in the bloom filter. > NB. Possibility of false positive. > contains=: 3 : 0 > l=:(hash y),((<:keysize)$0) > indices=:lrhash/\ l > NB. if a zero exists then this item > NB. was not added to the bloom filter. > -.(0 e. indices{BF) > > ) > > > Add some items to the bloom filter, animal names. > > NB. Add some items ot the filter. > additem&.> > ('dog';'cat';'bird';'octopus';'horse';'cow';'mouse';'chicken';'salmon';'hamster';'wolf';'elephant';'pig';'sheep';'zebra') > > and we can test words: > > contains 'dog' > 1 > > contains 'human' > 1 > > Oh, false positive. 'human' isn't in there. > > contains 'Dog' > 0 > > Case sensitive, obviously. > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > -- Devon McCormick, CFA Quantitative Consultant ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm