On Dec 20, 2010, at 2:16 PM, Jeff Johnson wrote:

> 
> But remember there's a 16-32x overkill built into the original Bloom filter 
> parameters,
> so its pretty clear a Bloom filter is gonna start "winning" soon, based 
> solely on minimum
> size criteria, compared to a string representation.
> 
> Next will be to measure specific per-package parameterized Bloom filters with 
> the original e=0.0001
> false positive rate.
> 

Here's some slightly adjusted output to carry around per-package Bloom filters:

[...@rhel6 lib]$ ./tbf
       npkgs: 1221
    Dirnames: 722712 bytes (20886 items)
  with XZDIO: 62416 bytes
Bloom filter: false positives: 0.0001
Uncompressed: 62204 bytes
  with XZDIO: 59612 bytes

A per-packge Bloom filter has 16 extra bytes of overhead:
        "BFL3"          magic as in "Bloom filter using lookup3"
        n
        m
        k
followed by the bytes in the bit map.

About the only subtlety was ensuring that n >= 10 to avoid rpmbdParams()
from resetting n = 10000.

Since the XZ compressed output (including useful overhead) is already smaller 
than
the simplistic (i.e. no markup overhead), I'll leave the rest to perl-URPM to
figure out.

Note that _ALL_ file paths and versionless Provides: can be represented
in Bloom filters, and will often (because of the assumption n >= 10)
fit within the per-package Bloom filters I just described.

Note also that 16b of per-packge overhead is overkill as well. One could
easily cut the overhead in half simply by changing to "B3" for magic and
writing uint16_t rather than uint32_t ... hang on ... here's that result:

[...@rhel6 lib]$ ./tbf
       npkgs: 1221
    Dirnames: 722712 bytes (20886 items)
  with XZDIO: 62416 bytes
Bloom filter: false positives: 0.0001
Uncompressed: 62204 bytes
  with XZDIO: 59328 bytes

Hmmm hardly enough to worry about (assuming I got the lib/tbf.c code right).

For that matter, compression could likely be _ENTIRELY_ skipped because the 
"Uncompressed:"
no. of bytes (which is only the bitmaps, no per-package overhead) is approx. 
the same as
the XZ compressed size (there is some modest saving because of XZ compression).

hth

73 de Jeff


______________________________________________________________________
RPM Package Manager                                    http://rpm5.org
Developer Communication List                        [email protected]

Reply via email to