On Jun 21, 2012, at 12:51 AM, Alexey Tourbin wrote:

> On Thu, Jun 21, 2012 at 12:15 AM, Alexey Tourbin
> <alexey.tour...@gmail.com> wrote:
>> On Mon, Jun 18, 2012 at 10:32 PM, Jeffrey Johnson <n3...@me.com> wrote:
>>> Good: the above confirmation of the characteristics allows a set:versions
>>> implementation to proceed.
>> 
>> Hello, there's been some speculation about Bloom filters below, which
>> I cannot address right now, offhand. Nevertheless, I can say that, in
>> some highly mathematical sense, set-versions are exactly equivalent to
>> Bloom filters. They do just the same thing, if you will. The only
>> difference is that set-versions are more compact: they take somewhat
>> less space, which was, if you remember, the number one goal of the
>> original implementation.
> 
> More precisely, a set-version can be (in principle) converted to a
> Bloom filter which uses only one hash function. The idea is that such
> a filter will set bits in a highly sparse set of bits, one by one.
> Instead, a set-version remembers the bits simply by their indices.
> Setting the bit becomes taking the number, and there is a
> straightforward correspondence. It also turns out that a set of
> numbers can be easier to deal with, as opposed to a sparse set of
> bits.
> 

You've made an unsupported claim here re "numbers can be easier".
Presumably you are talking about means of encoding, where clever
mathematics derived from numerical transforms can be used to
remove redundancy. With a pile of bits in a Bloom filter all one
has to work with is a pile of bits.

The counter argument is this: testing bits (in a Bloom filter) is
rather KISS. Other compression/redundancy removal schemes
end up trading storage for implementation complexity cost.
In a running installer like RPM, there will always be a need
for memory dealing with payload contents. Since large amounts
of memory are eventually going to be needed/used, savings
by using Golomb-Reid codes are mostly irrelevant for the
100K -> 1M set sizes used by RPM no matter whether bits or
numbers are used to represent.

But don't take my argument as being opposed to set:versions whatsoever.
Just (for RPM and even for *.rpm) that compression size isn't as important
to me as it is to you as the "primary goal" of set: versions.

> If you want to know more why things have to work like this, and e.g.
> where constants pop up, there is a good starting point at "Cache-,
> Hash-, and Space-efficient Bloom filters" paper by Felix Putze, Peter
> Sanders, and Johannes Singler. Actually this paper helped me a lot to
> put things together and to produce the original implementation.
> Reading this paper requires some working mathematical knowledge,
> though. This requirement must not be underestimated, but also should
> not be overestimated. The paper is very readable: it tells you what
> you may want to do and what you have to do.
> http://algo2.iti.kit.edu/singler/publications/cacheefficientbloomfilters-wea2007.pdf

Thanks for the pointer. I actually read and studied this paper a few years back
trying to decide whether there was a need to attempt cache-line optimization
(but I totally ignored the paragraph that links Golomb codes to Bloom filters).

At the time I decided that the complexity cost to attempt cache-line 
optimization
wasn't a priority in RPM, but rather KISS simplicity while users considered
changes that involve a finite probability of false positive failures.

AFAICT set:versions also has a finite probability of failure due to (relatively 
rare)
hash collisions converting strings to numbers.

Is there any hope that set:versions (or the Golomb-Rice code) can supply a 
parameterization
to "tune" the probability of false positives?

Or am I mis-understanding what is implemented?

73 de Jeff
> ______________________________________________________________________
> RPM Package Manager                                    http://rpm5.org
> Developer Communication List                        rpm-devel@rpm5.org

______________________________________________________________________
RPM Package Manager                                    http://rpm5.org
Developer Communication List                        rpm-devel@rpm5.org

Reply via email to