On Sat, Jun 23, 2012 at 7:10 AM, Jeffrey Johnson <n3...@me.com> wrote:
>> Why is it any wrong to minimize bandwidth, or, in other words, why it
>> is bad to spend less money? Your answer is like, because the meaning
>> of life is not to spend less money, which is a wrong perspective.
>> Okay, but what's a better perspective? Spending more money for less
>> good is not at all a better perspective.
>
> Nothing I said implies that bandwidth reduction is "wrong".
>
> If you want to minimize bandwidth downloading packages compress
> the metadata and deal with the "legacy compatible" issues however
> you want/need.
>
> My rule of thumb is that metadata is ~15% of the size of a
> typical *.rpm package: assuming 50% compression one might save
> 7.5% of the package download cost (0.50 * 0.15 = 7.5%).

An interesting practical matter behind set-versions is that we can
simply represent the exact set of symbols, which can be encoded and
pictured like this:

Provides: libfoo.so.0 = set:sym1,sym2,sym3,sym4
Requires: libfoo.so.0 = set:sym1,sym3
(where symN stands for direct ELF symbol name - e.g. strcmp). You
simply name the symbols which you require or provide! In fact, this is
exactly how early alpha set-versions were implemented - before I had
some time to ponder over approximate set/subset encoding problem (it
is then how sym1,sym2... sets where converted into numbers, per
symbol, and compressed). The point here is that the price of the exact
set representation may, or may not, be prohibitive. If the price is
not prohibitive, it's a no-brainer: you don't have to involve into
approximate subset business at all. Sometimes, you simply should not
use bloom filters, despite the fact that they might seem appealing.
However, if the price is prohibitive, which it was, the reason for
going into approximate subset business is also a no-brainer: you
should cut down heavily and optimize for size first. If you simply
introduced probabilities without making things less prohibitive, did
you do anything useful at all? You only spoiled things a bit!

There is also a philosophical consideration which somehow accompanies
this practical consideration. There is a short story, I believe by
Borges, where a clever scientist devises a 1-1 map of reality. A 1-1
map of reality turns out to be a very true picture of reality. What's
wrong with this approach? Well, a 1-1 map of the world turns out to be
an exact copy of the world, which is of no use in terms of being "a
map". Somehow, the reality must be "construed" and "reduced" to a
simpler (and somewhat coarser) description to become a useful model.
This is also why we don't plug ELF sections into RPM headers: we
believe that much simpler (or at least much shorter) dependencies must
be used to represent ELF binaries in terms of their
interconnectedness, and must also omit other less important details.

Back to the story of set-versions, with the "original" implementation
(which introduced full Golomb coding), it was estimated that the size
of architecture-dependent pkglist.classic.bz2 metadata is going to go
up from about 3M to about 12M, four-fold! This still was almost
prohibitive to soar up like this. It was considered non-prohibitive
only by the virtue of information-theoretical considerations: since we
are going to encode that many symbols, we must not fool ourselves into
thinking that we could somehow pay a smaller price - that is, without
violating fundamental laws.

By the way, what's the information-theoretical minimum? Say, we want
to encode 1024 20-bit hash values (which yields the false positive
rate at about 0.1%). Well, the first mathematical intuition is that we
need to cut 20-bit range into 1024 smaller stripes, which gives 10
bits per stripe, on average. It is a little bit more complicated than
that, though, exactly because of this "on average" business: we must
also take some bits to encode stripe boundaries. But this is only a
mathematical intuition. The exact formula, in R, is:

> lchoose(2**20,2**10)/log(2)/2**10
[1] 11.43581

(so, on the other hand, and somewhat unexpectedly, it is that old good
"n choose k" business.)  Current implementation lines up at about 11.6
bits per symbol. This is why sometimes I say that set-versions
currently take about 2 alnum character per Provides symbol - this is
because each alnum character can stuff about log2(62)=5.9+ bits.

But compare this to "early alpha" set-versions which represented exact
sets in form of "set:sym1,sym2,sym3,sym4", that is, in terms of
enumerating symbols. With exact sets, can you think of going to
anywhere near 2 characters per symbol, especially that you also need a
separator? This leaves you 1 character per symbol! :-)

There's a saying that Perl makes easy things easy and hard things
possible. Set-versions were designed to make hard things possible.
______________________________________________________________________
RPM Package Manager                                    http://rpm5.org
Developer Communication List                        rpm-devel@rpm5.org

Reply via email to