On Sat, Jun 23, 2012 at 7:10 AM, Jeffrey Johnson <n3...@me.com> wrote: >> Why is it any wrong to minimize bandwidth, or, in other words, why it >> is bad to spend less money? Your answer is like, because the meaning >> of life is not to spend less money, which is a wrong perspective. >> Okay, but what's a better perspective? Spending more money for less >> good is not at all a better perspective. > > Nothing I said implies that bandwidth reduction is "wrong". > > If you want to minimize bandwidth downloading packages compress > the metadata and deal with the "legacy compatible" issues however > you want/need. > > My rule of thumb is that metadata is ~15% of the size of a > typical *.rpm package: assuming 50% compression one might save > 7.5% of the package download cost (0.50 * 0.15 = 7.5%).
An interesting practical matter behind set-versions is that we can simply represent the exact set of symbols, which can be encoded and pictured like this: Provides: libfoo.so.0 = set:sym1,sym2,sym3,sym4 Requires: libfoo.so.0 = set:sym1,sym3 (where symN stands for direct ELF symbol name - e.g. strcmp). You simply name the symbols which you require or provide! In fact, this is exactly how early alpha set-versions were implemented - before I had some time to ponder over approximate set/subset encoding problem (it is then how sym1,sym2... sets where converted into numbers, per symbol, and compressed). The point here is that the price of the exact set representation may, or may not, be prohibitive. If the price is not prohibitive, it's a no-brainer: you don't have to involve into approximate subset business at all. Sometimes, you simply should not use bloom filters, despite the fact that they might seem appealing. However, if the price is prohibitive, which it was, the reason for going into approximate subset business is also a no-brainer: you should cut down heavily and optimize for size first. If you simply introduced probabilities without making things less prohibitive, did you do anything useful at all? You only spoiled things a bit! There is also a philosophical consideration which somehow accompanies this practical consideration. There is a short story, I believe by Borges, where a clever scientist devises a 1-1 map of reality. A 1-1 map of reality turns out to be a very true picture of reality. What's wrong with this approach? Well, a 1-1 map of the world turns out to be an exact copy of the world, which is of no use in terms of being "a map". Somehow, the reality must be "construed" and "reduced" to a simpler (and somewhat coarser) description to become a useful model. This is also why we don't plug ELF sections into RPM headers: we believe that much simpler (or at least much shorter) dependencies must be used to represent ELF binaries in terms of their interconnectedness, and must also omit other less important details. Back to the story of set-versions, with the "original" implementation (which introduced full Golomb coding), it was estimated that the size of architecture-dependent pkglist.classic.bz2 metadata is going to go up from about 3M to about 12M, four-fold! This still was almost prohibitive to soar up like this. It was considered non-prohibitive only by the virtue of information-theoretical considerations: since we are going to encode that many symbols, we must not fool ourselves into thinking that we could somehow pay a smaller price - that is, without violating fundamental laws. By the way, what's the information-theoretical minimum? Say, we want to encode 1024 20-bit hash values (which yields the false positive rate at about 0.1%). Well, the first mathematical intuition is that we need to cut 20-bit range into 1024 smaller stripes, which gives 10 bits per stripe, on average. It is a little bit more complicated than that, though, exactly because of this "on average" business: we must also take some bits to encode stripe boundaries. But this is only a mathematical intuition. The exact formula, in R, is: > lchoose(2**20,2**10)/log(2)/2**10 [1] 11.43581 (so, on the other hand, and somewhat unexpectedly, it is that old good "n choose k" business.) Current implementation lines up at about 11.6 bits per symbol. This is why sometimes I say that set-versions currently take about 2 alnum character per Provides symbol - this is because each alnum character can stuff about log2(62)=5.9+ bits. But compare this to "early alpha" set-versions which represented exact sets in form of "set:sym1,sym2,sym3,sym4", that is, in terms of enumerating symbols. With exact sets, can you think of going to anywhere near 2 characters per symbol, especially that you also need a separator? This leaves you 1 character per symbol! :-) There's a saying that Perl makes easy things easy and hard things possible. Set-versions were designed to make hard things possible. ______________________________________________________________________ RPM Package Manager http://rpm5.org Developer Communication List rpm-devel@rpm5.org