On May 1, 2009, at 9:35 AM, Matthew Toseland wrote:
THE NUMBERS:
The salted-hash datastore uses by default 1/2048th of the store size
for the
Bloom filters. This works out to a 23-element filter with 2-bit
counting,
i.e. 46 bits per key in the store (ssk, chk, pubkey, store or
cache). The
easiest thing would be to reduce this to a 1-bit filter, cutting its
size in
half. Also we don't need the pubkey store. So the filter for our
peers would
be approximately 1/6000th of the size of our datastore. This gives
18MB per
100GB datastore size. We could halve this again but it would mean
that the
average failing request is diverted to nodes which might have the
data but
don't 2.5 times (24 hops * 20 peers = 480 bloom checks, vs 0.005%
false
positives); each time the request has to wait. So if we have 20
peers each
with a 500GB store, that gives 1.8GB of Bloom filters! On the other
hand if
we have 20 peers each with a 10GB store, that gives 36MB of Bloom
filters.
Clearly the latter is acceptable; the former probably isn't for most
nodes.
Also note that for acceptable performance it is essential that the
Bloom
filters for all our online peers can fit into RAM. What we are going
to have
to do is set an upper limit on the total memory we can use for Bloom
filters,
and drop the biggest filters until we reach the target.
Am I missing something? That's not what I get, referencing...
http://en.wikipedia.org/wiki/Bloom_filter
"In theory, an optimal data structure equivalent to a counting Bloom
filter should not use more space than a static Bloom filter."
We have...
10 GB store => 333k keys (elements to be put in filter; "n" in the
article)
The article says 9.6 bits/element (with good hash functions) will
yield 1% error rate.
9.6*333*10^3 = 3,196,800 = 3 Mbits (filter size; "m" in the article)
~=~ 390kB
n=333,333
m=3,196,800
That only leaves 'k' (the number of bits set per entry); "Classic
Bloom filters use 1.44log2(1 / ε) bits of space per inserted key,
where ε is the false positive rate"
So then your 0.005% yields ~14 bits set per key.
log2( 1/(0.005*0.01) ) * 1.44 ~= 14
Or @ 1% => 10 bits set per key
Some questions though...
It still seems to me that all this is moving away from darknet
routing, and we will be sending our peers 'updates' on the new things
we've acquired in our data store... Is this adding an optimization to
the routing algorithim or replacing it?
Are you recommending that we use variable sizes of bloom filter (based
on data store size)?
It seems to me that a node could have enough logic to determine if a
static-sized bloom filter is 'too full' (if >50%,75% bits are set at
update time), and not count it (node has too big a datastore for bloom
functionality).
For that matter, could a peer manipulate its advertised filter to get
more requests and thus monitor activity? (e.g. full-set/all-1's)
--
Robert Hailey
_______________________________________________
Devl mailing list
[email protected]
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl