On Wed, Jun 8, 2011 at 1:05 PM, Raymond Hettinger <pyt...@rcn.com> wrote: > On Jun 6, 10:47 am, geremy condra <debat...@gmail.com> wrote: >> On Fri, Jun 3, 2011 at 1:17 PM, Raymond Hettinger <pyt...@rcn.com> wrote: >> > Thanks for all the feedback on the earlier post. >> >> > I've updated the recipe to use a cleaner API, simpler code, >> > more easily subclassable, and with optional optimizations >> > for better cache utilization and speed: >> >> > http://code.activestate.com/recipes/577684-bloom-filter/ >> >> Any chance of this landing in collections? > > I would be happy to add something like this to collections > once the API matures. > > Right now, the constructor expects the user to know the desired > memory size and number of probes. Those could be computed > from the maximum number of unique keys and the tolerable > error false positive rate. > > The most convenient constructor could be: > b = BloomFilter(myseq) > where len(myseq) is presumed indicate the maximum number of > unique entries and there is a default tolerable false positive > rate of 1 in 10,000. > > The API already lets the user substitute different methods > for get_probes(). It should also allow the bytearray to be > easily replaced by other objects with indexable access > (array objects, mmap objects, etc). > > We could also provide union, intersection, and difference > operations but these would be a little awkward for general > purpose use because the result is only meaningful when > the sizes, probe functions, and input domain are all the same. > > Fortunately, there is a lot of prior art, so the API design > can be informed by what has worked well for others.
Seems like the code that Dan Stromberg posted addresses a lot of this. Thoughts on his code? Also, if you're interested I'm about halfway through writing a C implementation, although it would have to be somewhat less flexible to get a real speed advantage. Right now there's a hefty enough speed penalty to the filter (the states test takes about 150 times as long with a bloom filter as with a normal set) that it may make sense to have a base case that uses an optimized C implementation but that can be overridden as needed. Geremy Condra -- http://mail.python.org/mailman/listinfo/python-list