On Dec 17, 2010, at 1:48 PM, Per Øyvind Karlsen wrote:
>
> So I guess there's something I'm not really fully grasping here...
>
> See code attached...
>
Yes. You miss that you need to estimate the expected size of the
population you wish to capture in a Bloom Filter:
size_t n = 0; /* <-- 0 will end up using an inappropriate default */
rpmbfParams(n, e, &m, &k);
WHat you need to do (if you wish to base your estimate on the "real world")
is something like this
rpm -qal | sort -u | wc -l
and then add 10% (which won't really matter but will make you feel better).
You likely also can get away with
static double e = 1.0e-4;
because (if you happen to hit a false positive in perl-URPM) its
not the end-of-the-world, just means that you will download
a package unnecessarily and add to transaction which will
be caught by rpmtsCheck() later. So "one-in-a-million" or
"one-in-a-billion" a priori estimates of an acceptable
probability for a false positive need to also look at
what happens with a "false positive" failure.
Also try adding
_rpmbf_debug = -1;
while getting up to speed on the parameter choices.
See also tdict.c test case in POPT using /usr/share/dict/words
that you ought to look at. It literally doesn't matter whether words
or filepaths are uses for set memember tokens, hashing chooses some
set of bits to flip to track set membership. The actual tokens used
don't matter a bit.
Your code is likely gonna need a per-package Bloom filter with NEVRA identifier,
because what perl-URPM need first is a list of what packages to download. As
written, you collapsing all paths in a repository into a single Bloom
filter. You need an array of per-package Bloom filters.
If you choose all per-package Bloom filters to be the same size (estimator
is ~10% larger than the size of the filelist in the package with the
most files, which should be something like 30-40K and with a probability
of false positive of ~0.01% (1.0e-4)), then you can use
rpmbfUnion/rpmbfIntersect
exactly the same as one would use OR and AND operations for boolean TRUE/FALSE.
So one can attempt a fancier bsearch "bisection" with better performance
down the road instead of a linear search to find which package contains
a given file path by looping over per-package Bloom Filters with identifiers
like NEVRA or URI. There's likely other performance wins that can be devised
since packages with a given prefix are likely to be the same end-point for
an unknown file path that is a continuation of the given prefix.
hth
73 de Jeff
> --
> Regards,
> Per Øvind
> <hdlist-bf.c>
______________________________________________________________________
RPM Package Manager http://rpm5.org
Developer Communication List [email protected]