On Sat, Jun 23, 2012 at 1:55 AM, Jeffrey Johnson <n3...@me.com> wrote:
> I would state that compression (of any sort) to minimize
> bandwidth is entirely the wrong problem to solve.

So what kind of a problem are we trying to solve? Why are you making
all these small puns? Are you ready to go out and pronounce that "I am
the Jefferey Johnson is entirely confident that whichever problem we
face we must address, but to minimize bandwidth is entirely wrong and
shall be addressed never!" :-)

Why is it any wrong to minimize bandwidth, or, in other words, why it
is bad to spend less money? Your answer is like, because the meaning
of life is not to spend less money, which is a wrong perspective.
Okay, but what's a better perspective? Spending more money for less
good is not at all a better perspective.

Okay, but what do we actually try to do? Um, we try to bring some
binary compatibility by using the limited amount of information with
some contrived data structures which test set-subset relation. If
space were not a problem, we could simply plug ELF sections into RPM
headers, couldn't we? Why not? And why is there a distinction between
the header and the payload? :-)

> My contrarian POV should not be taken as opposed to compression
> or elegance or anything else. Just that minimal size in the
> representation of bit fields (or bit positions as numbers)
> overly limits the applicability of set:versions.
>
> There are lots of usage cases for efficient sub-set computations
> in package management, not just as a de facto API/ABI check
> using ELF symbols. Most of the other usage cases for efficient
> sub-set computations are not subject to an ultimately optimal
> string encoding.

My POV is that I ask "how the best I can do what I need to do, is it
doable, what is the price, can it be reduced, etc". Of course, these
"best to do" and "need to do" terms are not exactly mathematical, and
I already hear some laughs in the audience. Nevertheless, the only
thing which you can oppose to this is simply a better implementation.
Anything else does not count. Why? That's because! Go tell people they
should spend more money. ;-)

> E.g. ripping off the base62 encoding and distributing binary
> assertion fields saves at least as much bandwidth as your
> guesstimate that a Golomb-Rice code is ~20% more efficient
> than a Bloom filter.

I see no point in criticizing base62 encoding being suboptimal as
compared to raw bytes, because it does just that: squeezes bits into
alnum characters. It is pretty clear and pointless that raw bytes
stuff more bits. The goal was just that - to "dump" bits into
"readable" and "usable" form.

> Just in case: yes acidic sarcasm was fully intended.

There is no point in your sarcasm, except that probably that you are
very clever (which I totally agree).

> You already made a valid point asking what probability means wrto
> false positives. In fact your 2**10/2**20 is a very different
> estimate of false positive probability than the approx. currently
> in use in rpmbf.c (which is based on a statistical model for
> Bloom filter parameters).

The 2**10/2**20 is the best estimate and you probably cannot further
improve it. The whole business of set-versions, which I have been
pondering recently, boils down to the questions, can you improve the
ratio just a little bit? Or can you pay just a little bit less for a
little bit more? Is there a better data structure? The answer is
basically turns out to be "NO". The "set of numbers" is good to go,
and the Galois fields have some discrepancies which you do not want to
know, and the benefit is very marginal.

> Below is what is in use by RPM which chooses {m, k} given
> an estimated population n and a desired probability of false
> positives e (I forget where I found the actual computation,
> can likely find it again if/when necessary).
>
> What I would like to be able to do with set:versions
> in RPM is to be able to use either of these 2 "container"
> representations interchangeably.
>
> I'd also like to be able to pre-compute either form
> in package tags for whatever purpose without the
> dreadful tyranny of being "legacy compatible" with
> older versions of RPM. Adding a new tag that isn't
> used by older RPM is exactly as compatible as following
> existing string representations for NEVR dependencies
> in existing tags, arguably more compatible because
> older versions of RPM are incapable of interpreting
> unknown tags.
>
> And again: the best solution for the download bandwidth
> problem is
>        Update the metadata less often.
> if you think a bit.

The best solution for "don't be stupid" is just don't be stupid. Of
course, downloading metadata is not the only problem, and not exactly
the one which we exactly try to solve. It all combines into a complex
system. The question is then, to me, can we do any better? How much
does this whole business cost? if it's still expensive, which it is,
how can we cut down on the costs? These kind of questions I am willing
to discuss with the great and unspeakable joy in my heart, full of the
glory. The implementation is pretty optimal, up to the point.
______________________________________________________________________
RPM Package Manager                                    http://rpm5.org
Developer Communication List                        rpm-devel@rpm5.org

Reply via email to