Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-07-03 Thread Jeffrey Johnson

On Jul 2, 2012, at 11:21 PM, Alexey Tourbin wrote:

 On Mon, Jul 2, 2012 at 9:17 AM, Jeffrey Johnson n3...@me.com wrote:
 All RPMv4.4+ packages, that is, but not RPMv4.0. I find this file
 coloring business very annoying, by the way, and it took me some time
 to realize that fc actually stands for file coloring. :-)
 
 RPMv4.0 was a l-o-n-g time ago.
 
 It was a long time ago, but it was not that bad. In ALT Linux we still
 use rpm-4.0 code base (with many important backports etc.). Since I've
 recently broken with ALT Linux, I cannot make more claims, you see…
 

In all cases: whether rpm-4.0.x had file classification is irrelevant to
whether a set:versions container can be used for keyword storage.

 I find multilib quite annoying: I was asked for an implementation,
 and did so. 'Twas a job mon: already 7y since leaving @redhat, and file
 colors for multi lib were several years before that.
 
 I don't like how multilib works either. You can no longer identify
 packages by their names, and then there are special rules to resolve
 file conflicts, which basically say that the license is that you can
 swamp files as much as you want, provided that you got the first hand
 in that strange x86-64 relationship. There is than that recent x32
 stuff where you can run in 64-bit mode using only 32-bit pointers. How
 can you address THAT? To me, the world is declining, like the Roman
 Empire. ;-)
 

Dependency graphs based solely on package names are overrated, particularly
when there are multiple naming schemes around, and packages are continuously
renamed, and so the benefits of consistent naming are already a pleasant lie.

With file colors, the test for x32 elf magic sets a file color bit and nothing
else in rpm needs changing.

The existing preference configuration to resolve executable (any non-elf 
executable
file conflict is a different issue) automates the resolution in 2 locations. 
Might
need a tweak to handle secondary chjoices (i.e. when neither of the conflicting
files is the preferred arch) … *shrug*


 I see no reason why rpm(1) should ever consider any preferences. To
 me, rpm(1) is exactly black-and-white thing, a watchdog which checks
 logical assertions. Things are either consistent enough to proceed
 (and e.g. to upgrade the packages), or not - and then you get e.g.
 non-zero exit code and you are forced to bail out. Higher-level logic
 of finding an upgrade plan anyways belongs to something like apt(1) or
 yum(1), although it is executed in some basic librpm terms which we
 must make efficient enough. I see no reason to discuss closest metrics
 or largest overlaps - this is as interesting as irrelevant to our
 basic tasks.
 
 Everyone has an opinion: yours is particularly brittle but
 entirely logical. Now go persuade everyone else that your
 opinion is The One True Opinion and RPM will surely change too.
 
 There's a similar application with dual/triple/... licensed software and
 computing
 per-file, not per-package, license affinity precisely where set:versions 
 (or
 Bloom
 filters) will represent keywords (like LGPLv2 or BSD) easily. Licenses
 unlike
 file(1) magic keywords will require name space administration. SUrely LSB
 and LFF
 are seeking something useful to do for RPM packaging these days, and might
 be convinced to make some set of license tokens standard so that license
 affinity can be precisely computed in distributed software.
 
 You then discuss more applications which are largely irrelevant to our
 basic tasks. (I realize that I'm revisiting and older discussion,
 which might not be completely fair because our understanding might
 have evolved since.)
 
 Um … what are … our basic tasks.?
 
 Our basic task, is that when we feed packages to rpm(1), it must
 quickly decide whether the upgrade is feasible or not. Of course this
 involves hard and sometimes speculative considerations whether things
 are going to work. But if things are definitely not going to work, rpm
 should upgrade never! Not without a special flag which reads
 '--upgrade-as-i-wish'.
 

We disagree on several points here:
1) upgrade decisions based solely on version information do not handle 
rebuilds.
2) FULLSTOP never! behavior when something breaks makes no forward
progress when there's some minor issue. A best effort is generally a 
better approach
even in the face of occasionally making matters worse.
3) adding Yet Another Disabler that noone knows how to use solves no 
meaningful problem.

 What is not part of our basic tasks is to find an upgrade plan. How
 they do that - it's another business, and completely another story. We
 only check if the upgrade is consistent or not.
 

Assuming transitive closure on dependency assertions isn't all that is needed.

 I cannot determine what is fair without knowing what is being compared …
 
 Anyway, set-versions are not the next big thing with plenty of
 applications. It's rather a very boring stuff which nevertheless
 

Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-07-02 Thread Alexey Tourbin
On Mon, Jul 2, 2012 at 9:17 AM, Jeffrey Johnson n3...@me.com wrote:
 All RPMv4.4+ packages, that is, but not RPMv4.0. I find this file
 coloring business very annoying, by the way, and it took me some time
 to realize that fc actually stands for file coloring. :-)

 RPMv4.0 was a l-o-n-g time ago.

It was a long time ago, but it was not that bad. In ALT Linux we still
use rpm-4.0 code base (with many important backports etc.). Since I've
recently broken with ALT Linux, I cannot make more claims, you see...

 I find multilib quite annoying: I was asked for an implementation,
 and did so. 'Twas a job mon: already 7y since leaving @redhat, and file
 colors for multi lib were several years before that.

I don't like how multilib works either. You can no longer identify
packages by their names, and then there are special rules to resolve
file conflicts, which basically say that the license is that you can
swamp files as much as you want, provided that you got the first hand
in that strange x86-64 relationship. There is than that recent x32
stuff where you can run in 64-bit mode using only 32-bit pointers. How
can you address THAT? To me, the world is declining, like the Roman
Empire. ;-)

 I see no reason why rpm(1) should ever consider any preferences. To
 me, rpm(1) is exactly black-and-white thing, a watchdog which checks
 logical assertions. Things are either consistent enough to proceed
 (and e.g. to upgrade the packages), or not - and then you get e.g.
 non-zero exit code and you are forced to bail out. Higher-level logic
 of finding an upgrade plan anyways belongs to something like apt(1) or
 yum(1), although it is executed in some basic librpm terms which we
 must make efficient enough. I see no reason to discuss closest metrics
 or largest overlaps - this is as interesting as irrelevant to our
 basic tasks.

 Everyone has an opinion: yours is particularly brittle but
 entirely logical. Now go persuade everyone else that your
 opinion is The One True Opinion and RPM will surely change too.

 There's a similar application with dual/triple/... licensed software and
 computing
 per-file, not per-package, license affinity precisely where set:versions (or
 Bloom
 filters) will represent keywords (like LGPLv2 or BSD) easily. Licenses
 unlike
 file(1) magic keywords will require name space administration. SUrely LSB
 and LFF
 are seeking something useful to do for RPM packaging these days, and might
 be convinced to make some set of license tokens standard so that license
 affinity can be precisely computed in distributed software.

 You then discuss more applications which are largely irrelevant to our
 basic tasks. (I realize that I'm revisiting and older discussion,
 which might not be completely fair because our understanding might
 have evolved since.)

 Um … what are … our basic tasks.?

Our basic task, is that when we feed packages to rpm(1), it must
quickly decide whether the upgrade is feasible or not. Of course this
involves hard and sometimes speculative considerations whether things
are going to work. But if things are definitely not going to work, rpm
should upgrade never! Not without a special flag which reads
'--upgrade-as-i-wish'.

What is not part of our basic tasks is to find an upgrade plan. How
they do that - it's another business, and completely another story. We
only check if the upgrade is consistent or not.

 I cannot determine what is fair without knowing what is being compared …

 Anyway, set-versions are not the next big thing with plenty of
 applications. It's rather a very boring stuff which nevertheless
 answers the question how we can possibly enhance ABI compatibility
 control beyond sonames. The answer is that we must involve into
 set/subset testing - that's the model, that it is very expensive, and
 that the only reasonable and possibly the best way to go is to replace
 symbols with numbers, and to treat sets of numbers as special kind of
 versions. Now why is that? But that's a much better perspective for
 discussion.

 We differ in usage cases. I see a container, you see a version.

I realize that the version is somehow a contrived concept behind
really a containter. Like: - Whats' that? - It's a version. - Oh my
gosh!  But then again, if you try to organize your thinking beyond
sets, bytes, and characters, the concept of version pops up pretty
naturally. If we are going to satisfy some requirement, that's good
for version.

 There are many usage cases for subset operations in package management
 no matter what we think or how the operations are implemented and represented.

Not all subset operations are equally important, or complex. As I
said, sometimes you should make easy things easy, and simply use
std::set which is Rb-tree or something like this. Sometimes things
go off, though. Like this: - How many symbols do you want to encode? -
10M. - That's a lot, are you ready to pay? - No. - That's not serious,
are you ready to pay at least 2 characters per symbol? - Well,

Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-07-01 Thread Alexey Tourbin
On Sat, Jun 23, 2012 at 10:29 PM, Jeffrey Johnson n3...@me.com wrote:
 In the interest of getting off negative nerdy obscure discussions, let's
 try a positive alternative application for Golob-Rice subset operations.

 All RPMv4 packages attach (a lightly filtered) file(1) magic string to
 every file.

All RPMv4.4+ packages, that is, but not RPMv4.0. I find this file
coloring business very annoying, by the way, and it took me some time
to realize that fc actually stands for file coloring. :-)

 The file(1) data is mostly usable as a keyword namespace exactly as is.
 Yes there are flaws: however magic strings are from file(1) is about
 as good as any other de facto keyword tagging of file content.

 keywords are strings just like elf symbols are, and set:versions (or Bloom
 filters)
 are a compact representation from which its rather easy to do subset
 computations.

 One extension that would be needed is a closest metric in order to
 prefer
 the largest subset overlap: with set:versions any contained subset will
 satisfy the
 logical assertions, and there's no easy way to prefer the larger sub-set.

I see no reason why rpm(1) should ever consider any preferences. To
me, rpm(1) is exactly black-and-white thing, a watchdog which checks
logical assertions. Things are either consistent enough to proceed
(and e.g. to upgrade the packages), or not - and then you get e.g.
non-zero exit code and you are forced to bail out. Higher-level logic
of finding an upgrade plan anyways belongs to something like apt(1) or
yum(1), although it is executed in some basic librpm terms which we
must make efficient enough. I see no reason to discuss closest metrics
or largest overlaps - this is as interesting as irrelevant to our
basic tasks.

 There's a similar application with dual/triple/... licensed software and
 computing
 per-file, not per-package, license affinity precisely where set:versions (or
 Bloom
 filters) will represent keywords (like LGPLv2 or BSD) easily. Licenses
 unlike
 file(1) magic keywords will require name space administration. SUrely LSB
 and LFF
 are seeking something useful to do for RPM packaging these days, and might
 be convinced to make some set of license tokens standard so that license
 affinity can be precisely computed in distributed software.

You then discuss more applications which are largely irrelevant to our
basic tasks. (I realize that I'm revisiting and older discussion,
which might not be completely fair because our understanding might
have evolved since.)

Anyway, set-versions are not the next big thing with plenty of
applications. It's rather a very boring stuff which nevertheless
answers the question how we can possibly enhance ABI compatibility
control beyond sonames. The answer is that we must involve into
set/subset testing - that's the model, that it is very expensive, and
that the only reasonable and possibly the best way to go is to replace
symbols with numbers, and to treat sets of numbers as special kind of
versions. Now why is that? But that's a much better perspective for
discussion.
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-07-01 Thread Jeffrey Johnson

On Jul 1, 2012, at 11:43 PM, Alexey Tourbin alexey.tour...@gmail.com wrote:

 On Sat, Jun 23, 2012 at 10:29 PM, Jeffrey Johnson n3...@me.com wrote:
 In the interest of getting off negative nerdy obscure discussions, let's
 try a positive alternative application for Golob-Rice subset operations.
 
 All RPMv4 packages attach (a lightly filtered) file(1) magic string to
 every file.
 
 All RPMv4.4+ packages, that is, but not RPMv4.0. I find this file
 coloring business very annoying, by the way, and it took me some time
 to realize that fc actually stands for file coloring. :-)
 

RPMv4.0 was a l-o-n-g time ago.

I find multilib quite annoying: I was asked for an implementation,
and did so. 'Twas a job mon: already 7y since leaving @redhat, and file
colors for multi lib were several years before that.

And the fc stands for file classifier. But feel free
to claim whatever you want just like everyone else.


 The file(1) data is mostly usable as a keyword namespace exactly as is.
 Yes there are flaws: however magic strings are from file(1) is about
 as good as any other de facto keyword tagging of file content.
 
 keywords are strings just like elf symbols are, and set:versions (or Bloom
 filters)
 are a compact representation from which its rather easy to do subset
 computations.
 
 One extension that would be needed is a closest metric in order to
 prefer
 the largest subset overlap: with set:versions any contained subset will
 satisfy the
 logical assertions, and there's no easy way to prefer the larger sub-set.
 
 I see no reason why rpm(1) should ever consider any preferences. To
 me, rpm(1) is exactly black-and-white thing, a watchdog which checks
 logical assertions. Things are either consistent enough to proceed
 (and e.g. to upgrade the packages), or not - and then you get e.g.
 non-zero exit code and you are forced to bail out. Higher-level logic
 of finding an upgrade plan anyways belongs to something like apt(1) or
 yum(1), although it is executed in some basic librpm terms which we
 must make efficient enough. I see no reason to discuss closest metrics
 or largest overlaps - this is as interesting as irrelevant to our
 basic tasks.

Everyone has an opinion: yours is particularly brittle but
entirely logical. Now go persuade everyone else that your
opinion is The One True Opinion and RPM will surely change too.

 
 There's a similar application with dual/triple/... licensed software and
 computing
 per-file, not per-package, license affinity precisely where set:versions (or
 Bloom
 filters) will represent keywords (like LGPLv2 or BSD) easily. Licenses
 unlike
 file(1) magic keywords will require name space administration. SUrely LSB
 and LFF
 are seeking something useful to do for RPM packaging these days, and might
 be convinced to make some set of license tokens standard so that license
 affinity can be precisely computed in distributed software.
 
 You then discuss more applications which are largely irrelevant to our
 basic tasks. (I realize that I'm revisiting and older discussion,
 which might not be completely fair because our understanding might
 have evolved since.)
 

Um … what are … our basic tasks.?

I cannot determine what is fair without knowing what is being compared …

 Anyway, set-versions are not the next big thing with plenty of
 applications. It's rather a very boring stuff which nevertheless
 answers the question how we can possibly enhance ABI compatibility
 control beyond sonames. The answer is that we must involve into
 set/subset testing - that's the model, that it is very expensive, and
 that the only reasonable and possibly the best way to go is to replace
 symbols with numbers, and to treat sets of numbers as special kind of
 versions. Now why is that? But that's a much better perspective for
 discussion.

We differ in usage cases. I see a container, you see a version.

There are many usage cases for subset operations in package management
no matter what we think or how the operations are implemented and represented.

73 de Jeff

 __
 RPM Package Managerhttp://rpm5.org
 Developer Communication Listrpm-devel@rpm5.org

__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-28 Thread Alexey Tourbin
On Sat, Jun 23, 2012 at 7:10 AM, Jeffrey Johnson n3...@me.com wrote:
 Why is it any wrong to minimize bandwidth, or, in other words, why it
 is bad to spend less money? Your answer is like, because the meaning
 of life is not to spend less money, which is a wrong perspective.
 Okay, but what's a better perspective? Spending more money for less
 good is not at all a better perspective.

 Nothing I said implies that bandwidth reduction is wrong.

 If you want to minimize bandwidth downloading packages compress
 the metadata and deal with the legacy compatible issues however
 you want/need.

 My rule of thumb is that metadata is ~15% of the size of a
 typical *.rpm package: assuming 50% compression one might save
 7.5% of the package download cost (0.50 * 0.15 = 7.5%).

An interesting practical matter behind set-versions is that we can
simply represent the exact set of symbols, which can be encoded and
pictured like this:

Provides: libfoo.so.0 = set:sym1,sym2,sym3,sym4
Requires: libfoo.so.0 = set:sym1,sym3
(where symN stands for direct ELF symbol name - e.g. strcmp). You
simply name the symbols which you require or provide! In fact, this is
exactly how early alpha set-versions were implemented - before I had
some time to ponder over approximate set/subset encoding problem (it
is then how sym1,sym2... sets where converted into numbers, per
symbol, and compressed). The point here is that the price of the exact
set representation may, or may not, be prohibitive. If the price is
not prohibitive, it's a no-brainer: you don't have to involve into
approximate subset business at all. Sometimes, you simply should not
use bloom filters, despite the fact that they might seem appealing.
However, if the price is prohibitive, which it was, the reason for
going into approximate subset business is also a no-brainer: you
should cut down heavily and optimize for size first. If you simply
introduced probabilities without making things less prohibitive, did
you do anything useful at all? You only spoiled things a bit!

There is also a philosophical consideration which somehow accompanies
this practical consideration. There is a short story, I believe by
Borges, where a clever scientist devises a 1-1 map of reality. A 1-1
map of reality turns out to be a very true picture of reality. What's
wrong with this approach? Well, a 1-1 map of the world turns out to be
an exact copy of the world, which is of no use in terms of being a
map. Somehow, the reality must be construed and reduced to a
simpler (and somewhat coarser) description to become a useful model.
This is also why we don't plug ELF sections into RPM headers: we
believe that much simpler (or at least much shorter) dependencies must
be used to represent ELF binaries in terms of their
interconnectedness, and must also omit other less important details.

Back to the story of set-versions, with the original implementation
(which introduced full Golomb coding), it was estimated that the size
of architecture-dependent pkglist.classic.bz2 metadata is going to go
up from about 3M to about 12M, four-fold! This still was almost
prohibitive to soar up like this. It was considered non-prohibitive
only by the virtue of information-theoretical considerations: since we
are going to encode that many symbols, we must not fool ourselves into
thinking that we could somehow pay a smaller price - that is, without
violating fundamental laws.

By the way, what's the information-theoretical minimum? Say, we want
to encode 1024 20-bit hash values (which yields the false positive
rate at about 0.1%). Well, the first mathematical intuition is that we
need to cut 20-bit range into 1024 smaller stripes, which gives 10
bits per stripe, on average. It is a little bit more complicated than
that, though, exactly because of this on average business: we must
also take some bits to encode stripe boundaries. But this is only a
mathematical intuition. The exact formula, in R, is:

 lchoose(2**20,2**10)/log(2)/2**10
[1] 11.43581

(so, on the other hand, and somewhat unexpectedly, it is that old good
n choose k business.)  Current implementation lines up at about 11.6
bits per symbol. This is why sometimes I say that set-versions
currently take about 2 alnum character per Provides symbol - this is
because each alnum character can stuff about log2(62)=5.9+ bits.

But compare this to early alpha set-versions which represented exact
sets in form of set:sym1,sym2,sym3,sym4, that is, in terms of
enumerating symbols. With exact sets, can you think of going to
anywhere near 2 characters per symbol, especially that you also need a
separator? This leaves you 1 character per symbol! :-)

There's a saying that Perl makes easy things easy and hard things
possible. Set-versions were designed to make hard things possible.
__
RPM Package Managerhttp://rpm5.org
Developer Communication List

Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-28 Thread Jeffrey Johnson

On Jun 28, 2012, at 12:18 PM, Alexey Tourbin wrote:

 
 There is also a philosophical consideration which somehow accompanies
 this practical consideration. There is a short story, I believe by
 Borges, where a clever scientist devises a 1-1 map of reality. A 1-1
 map of reality turns out to be a very true picture of reality. What's
 wrong with this approach? Well, a 1-1 map of the world turns out to be
 an exact copy of the world, which is of no use in terms of being a
 map. Somehow, the reality must be construed and reduced to a
 simpler (and somewhat coarser) description to become a useful model.
 This is also why we don't plug ELF sections into RPM headers: we
 believe that much simpler (or at least much shorter) dependencies must
 be used to represent ELF binaries in terms of their
 interconnectedness, and must also omit other less important details.
 

(aside)
Nice: I'll dig out the Borges story … meanwhile (iirc) there is a story (and 
surely
some sort of literary response to Borges) called
How Much Shall We Bet
in Cosmicomics that I pass along for your amusement.

Seriously yes: carrying redundant metadata _EXACTLY_ as is
in the interest of accurate data -- while well intended -- isn't
needed.

And both Bloom filters and set:versions are useful/compact data
stores for existence tests without a need to also carry the _EXACT_
values (if/when the false positive rate can be tolerated, which
is often true for the moderate sized sets needed in package management).

The rest I agree with.

73 de Jeff__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-28 Thread Alexey Tourbin
On Thu, Jun 28, 2012 at 8:18 PM, Alexey Tourbin
alexey.tour...@gmail.com wrote:
 There is also a philosophical consideration which somehow accompanies
 this practical consideration. There is a short story, I believe by
 Borges, where a clever scientist devises a 1-1 map of reality. A 1-1
 map of reality turns out to be a very true picture of reality. What's
 wrong with this approach? Well, a 1-1 map of the world turns out to be
 an exact copy of the world, which is of no use in terms of being a
 map. Somehow, the reality must be construed and reduced to a
 simpler (and somewhat coarser) description to become a useful model.
 This is also why we don't plug ELF sections into RPM headers: we
 believe that much simpler (or at least much shorter) dependencies must
 be used to represent ELF binaries in terms of their
 interconnectedness, and must also omit other less important details.

This philosophical argument applies to set-versions in a
(not-so-)obvious manner, which I will now clarify. It goes like this:
although the ultimate goal is to check that R-set of symbols is a
subset of P-set of symbols, you do not necessarily have to store the
full names of the symbols in order to perform a somewhat stripped-down
check itself. When it only matters if R is subset of P, the names
themselves become largely irrelevant, provided that you can devise a
very clever substitution/encoding scheme. You can make much simpler
(or at least shorter) dependencies by getting rid of the names in a
manner which does not destroy the check.

The downside is, of course, that when a dependency R subset P is
broken, it is not easy to find out which P symbols were deleted or
renamed (or which R symbols are missing).  But this is largely a
developer's, or should I say a hacker's, problem.

On the other hand, from the user point of view, and also from rpm(1)
perspective, this approach simply promotes synchronous or rather
transactional upgrades. It says like, guys, I will not apply
half-baked updates before you fix it all - so that apps and libraries
match. Which totally makes sense!
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-28 Thread Jeffrey Johnson

On Jun 28, 2012, at 1:27 PM, Alexey Tourbin wrote:

 
 The downside is, of course, that when a dependency R subset P is
 broken, it is not easy to find out which P symbols were deleted or
 renamed (or which R symbols are missing).  But this is largely a
 developer's, or should I say a hacker's, problem.
 

Calculating set union/disjunction (for Bloom filters, surely almost as
easy with set:versions) for P/R sets
union   P | R
intersectionP  R
is natural/rapid/easy even if the symbols are unknown.

FWIW: the arbitrary tags implemented @rpm5.org have a similar
one-way lossy mapping where the string representation of the
tag name cannot be unmapped from the 30 bit (truncated SHA1) tag number.

The lack of mapping invertibility matters little to batch oriented installer 
usage cases
in RPM libraries even if enquiring minds wish concise/complete/informative 
error
messages for diagnosis of problems.

Error messages (and symbols) are overrated and less useful with moderately
sized token dictionaries imho.

I hope to get around to generating set:versions this weekend. I was almost there
10 days ago, but got side-tracked by 1 week's worth of cable modem issues which
has now led to postponed backup/upgrade chores … almost finished.

73 de Jeff__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-23 Thread Alexey Tourbin
On Sat, Jun 23, 2012 at 1:55 AM, Jeffrey Johnson n3...@me.com wrote:
 There are lots of usage cases for efficient sub-set computations
 in package management, not just as a de facto API/ABI check
 using ELF symbols. Most of the other usage cases for efficient
 sub-set computations are not subject to an ultimately optimal
 string encoding.

The possibility of optimal string encoding should not be
underestimated. If you can't write it down with a pencil, which by
the way refers to Alan Turing's style of reasoning, it becomes very
problematic anyway. Your intuition is probably that raw bytes are
cheap because you can bitwise-AND them in terms of direct CPU
instructions, and any encoding is expensive because you have to
crunch bits a lot. This is not very true in practice, though. But you
should arm yourself with valgrind(1) and spend a few days with it
before you understand how you waste you CPU powers (and bandwidth, for
that matter). Set-string encoding has reasonable, and affordable,
cost. It can be all put together and presented in a rather less
pessimistic manner. (For example, there is a cache_decode_set routine
that boosts things by a factor of about o 5. This is simply due to the
fact that you do not always have to decode the same Provides version
all over again. This is part of that fancy-schmancy stuff which you
must ignore on the first reading.)

So I suggest that set-strings must cover all usages of set-subset
computations, where static data structures are appropriate (so that
you compute it, write into RPM header, and do not expect to modify
it). it is still efficient, despite the fact that we try hard to
achieve a better string representation. The fact that it can be
packaged nicely does not make it any worse in other respects, except
that obviously we can stuff somewhat less bits into alnum characters.
But I also wanted it to look nice (this is why base62 was devised
over much simpler base64 encoding). But so, it looks nice, and it is
efficient. Why are you unhappy? Do you necessarily require that you
have to do bitmask tests yourself?
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-23 Thread Jeffrey Johnson

On Jun 23, 2012, at 1:49 PM, Alexey Tourbin alexey.tour...@gmail.com wrote:

 On Sat, Jun 23, 2012 at 1:55 AM, Jeffrey Johnson n3...@me.com wrote:
 There are lots of usage cases for efficient sub-set computations
 in package management, not just as a de facto API/ABI check
 using ELF symbols. Most of the other usage cases for efficient
 sub-set computations are not subject to an ultimately optimal
 string encoding.
 
 The possibility of optimal string encoding should not be
 underestimated. If you can't write it down with a pencil, which by
 the way refers to Alan Turing's style of reasoning, it becomes very
 problematic anyway. Your intuition is probably that raw bytes are
 cheap because you can bitwise-AND them in terms of direct CPU
 instructions, and any encoding is expensive because you have to
 crunch bits a lot. This is not very true in practice, though. But you
 should arm yourself with valgrind(1) and spend a few days with it
 before you understand how you waste you CPU powers (and bandwidth, for
 that matter). Set-string encoding has reasonable, and affordable,
 cost. It can be all put together and presented in a rather less
 pessimistic manner. (For example, there is a cache_decode_set routine
 that boosts things by a factor of about o 5. This is simply due to the
 fact that you do not always have to decode the same Provides version
 all over again. This is part of that fancy-schmancy stuff which you
 must ignore on the first reading.)
 
 So I suggest that set-strings must cover all usages of set-subset
 computations, where static data structures are appropriate (so that
 you compute it, write into RPM header, and do not expect to modify
 it). it is still efficient, despite the fact that we try hard to
 achieve a better string representation. The fact that it can be
 packaged nicely does not make it any worse in other respects, except
 that obviously we can stuff somewhat less bits into alnum characters.
 But I also wanted it to look nice (this is why base62 was devised
 over much simpler base64 encoding). But so, it looks nice, and it is
 efficient. Why are you unhappy? Do you necessarily require that you
 have to do bitmask tests yourself?

I'm not unhappy. Rather I'm pointing out that -- if your primary
goal is bandwidth reduction in *.rpm package metadata -- that there are
equivalent savings that aren't from pretty encodings.

The other reason is this: missing values in the {E,V,R} triple
(and attaching some reasonable DWIM semantic) lead to huge
amounts of pointless discussion. Adding all of
twiddle-in-version
set:versions
dpkg-like
is all underway atm. Running digest/encoding strings through
legacy compatible (i.e. what most rpm tool chains will
continue to be using for years) rpmvercmp(3) is an accident
waiting to be reported again again again.

Its just as easy to attach a tag with binary data as any other
approach to pretty encodings, all of which involve some
aesthetic choice of permitted/excluded ASCII characters, and so
the encodings proliferate to meet all possible opinions.

Perhaps making base61 encoding MANDATORY in rpm would displease
everyone equally: choosing a prime is as pretty as all other
encoding criteria. 

73 de Jeff

__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-23 Thread Jeffrey Johnson

On Jun 23, 2012, at 2:10 PM, Jeffrey Johnson wrote:

 
 On Jun 23, 2012, at 1:49 PM, Alexey Tourbin alexey.tour...@gmail.com wrote:
 

…

 Perhaps making base61 encoding MANDATORY in rpm would displease
 everyone equally: choosing a prime is as pretty as all other
 encoding criteria. 
 

In the interest of getting off negative nerdy obscure discussions, let's
try a positive alternative application for Golob-Rice subset operations.

All RPMv4 packages attach (a lightly filtered) file(1) magic string to
every file.

The file(1) data is mostly usable as a keyword namespace exactly as is.
Yes there are flaws: however magic strings are from file(1) is about
as good as any other de facto keyword tagging of file content.

keywords are strings just like elf symbols are, and set:versions (or Bloom 
filters)
are a compact representation from which its rather easy to do subset 
computations.

One extension that would be needed is a closest metric in order to prefer
the largest subset overlap: with set:versions any contained subset will satisfy 
the
logical assertions, and there's no easy way to prefer the larger sub-set.

There's a similar application with dual/triple/... licensed software and 
computing
per-file, not per-package, license affinity precisely where set:versions (or 
Bloom
filters) will represent keywords (like LGPLv2 or BSD) easily. Licenses 
unlike
file(1) magic keywords will require name space administration. SUrely LSB and 
LFF
are seeking something useful to do for RPM packaging these days, and might
be convinced to make some set of license tokens standard so that license
affinity can be precisely computed in distributed software.

73 de Jeff



Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-22 Thread Alexey Tourbin
On Thu, Jun 21, 2012 at 7:51 PM, Jeffrey Johnson n3...@me.com wrote:
 On Jun 18, 2012, at 2:32 PM, Jeffrey Johnson wrote:

 The contained in or subset semantic that applies to the operations  and 
 =
 is rather easy to do as well. E.g. if (assuming on;y existence, not 
 versioned inequality ranges)
       P == Bloom filter of Provides: tokens
       R == Bloom filter of a (possible) subset of tokens
 then the subset computation is nothing more than
       (P  R) == R

That's right. Set-versions do the same thing, only reframed in terms
of set of numbers (as opposed to explicit bit masks). The rpmsetcmp
routine basically does just that: (P  R) == R check, written in a
slightly different manner, with a bunch of fancy stuff. :-)

 I should point out the implicit (and tricky) assumption regarding
 using Bloom filters to easily compute set and union intersection:
        Both P and R MUST have exactly the same parameterization.
 Choosing an a priori parametrization is hard because it MUST
 deal with the worst possible case of the largest estimated population
 which increases the sparseness of all other Bloom filters.

There are a few kinds of parametrization which we must ponder.
First, you should use the same hash function for R and P sets, so that
it makes the same number per symbol (or sets the same bits). With
different hash functions, there is no chance to compare sets
meaningfully. Second, you should also ponder what a number actually
is (or how high a bit you may want to be able to set). The thing is,
the numbers are not unlimited (or otherwise the whole world can be
represented with just a big number). It helps to think of a number as
a tiny bit of information within a limited range. The range is another
parameter.

 This also applies to tuning to reduce the probability of false positives.

 There are no obvious ways to rescale a Bloom filter meaningfully either.

There IS a (not-so-)obvious way to rescale Bloom filters, and numbers,
for that matter. Big surprise, big surprise! It boils down to the
question what a number is. If your number is just a few bits in the
range modulo power of 2, you can rescale the number down to a
smaller range by simply stripping its higher bits. The equivalent
operation can be devised to downsample a bloom filter into a lesser
precision, provided that some (not-so-)obvious conditions are met.
Basically you need to split the filter into two halves and bitwise-OR
them.

 Do set: versions have the same difficulty (even if the parameters are hidden
 as implementation/design constraints somehow)?

There is some difficulty that we always need to use the same hash
function, which is hardwired, and cannot be easily changed (it is
Jenkins on-at-a-time hash with a fancy initial constant). There is no
difficulty to compare set-versions per se, though. Because a
set-versions is just a set of numbers within a limited range modulo
power of two. If ranges are different, you first need to downsample
the higher-ranged set by stripping its higher bits and sorting the
numbers again, but then you proceed normally. Note that the range, or
bpp, has to be encoded explicitly, and is part of a set-version.
Again, it helps to think in terms what a number is. A number is a
thing within the range, modulo power of two. If you don't know the
range, you don't know what the number is.

 Or does the Golomb-Rice encoding scale more naturally than Bloom filter 
 parameters?
 If the set:versions implementation scales better than Bloom filters (which 
 seems to be the case),
 then there are lots of usage cases (like packages in a repository, or files 
 in a package) where
 an efficient means to compute set membership is quite useful.

 (aside)
 Apologies for scales imprecision. I'm merely trying to ask is
        To what extent does set:versions depend on a priori assumptions?

 BTW, what is the current false positive failure probability for set:versions?

What is a probability? (That's the second stunning questions after
what is a number.) If numbers were unlimited, they could have
represented symbols exactly. The only reason the numbers clash is
because they sit within a limited range. What you have to do to
control probabilities is to select the appropriate range, which is a
trade-off between how many clashes are possible and how much bits per
number you are allowed to take. The basic idea is that, to encode a
Provides version, you first need to know how many symbols you are
going to encode, say 1024. You then ask yourself, how many bits per
symbol should I take. If you take e.g. only 10 bits, that's plain
stupid, because 10-bit numbers can address only 1024 values (so you
end up with most of the bits set, or most of the numbers taken, that
is to say). If you take 20 bits per symbol, though, things get more
interesting. 20-bit numbers can address 2^{20} different symbols.
Since the probability is due to range limitations, you get
2^{10}/2^{20}=0.1% false positive rate, which is basically what you
want. 

Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-22 Thread Michael Schroeder
On Fri, Jun 22, 2012 at 09:39:22PM +0400, Alexey Tourbin wrote:
 What is a probability? (That's the second stunning questions after
 what is a number.) If numbers were unlimited, they could have
 represented symbols exactly. The only reason the numbers clash is
 because they sit within a limited range. What you have to do to
 control probabilities is to select the appropriate range, which is a
 trade-off between how many clashes are possible and how much bits per
 number you are allowed to take. The basic idea is that, to encode a
 Provides version, you first need to know how many symbols you are
 going to encode, say 1024. You then ask yourself, how many bits per
 symbol should I take. If you take e.g. only 10 bits, that's plain
 stupid, because 10-bit numbers can address only 1024 values (so you
 end up with most of the bits set, or most of the numbers taken, that
 is to say). If you take 20 bits per symbol, though, things get more
 interesting. 20-bit numbers can address 2^{20} different symbols.
 Since the probability is due to range limitations, you get
 2^{10}/2^{20}=0.1% false positive rate, which is basically what you
 want. When encoding a Requires version, it's another matter: you
 should use the same number of bits as the corresponding Requires
 versions have.

You mean the corresponding Provides, i.e. the symbols in the target
library, right? Basically you should never need to downsample when
doing the set comparison, as downsampling kills your probablilities.

Cheers,
  Michael.

-- 
Michael Schroeder   m...@suse.de
SUSE LINUX Products GmbH,  GF Jeff Hawn, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-22 Thread Alexey Tourbin
On Fri, Jun 22, 2012 at 5:30 AM, Jeffrey Johnson n3...@me.com wrote:
 Sure numbers make sense.

 But God invented 0 and 1 and who needs steenkin carries to do
 arithmetic in Galois fields?!?

Jeffery, I understand that you are ironic and sarcastic, but I can't
see the reason why, as per our discussion. If you ask honest
questions, like what this means, I try to do my best to answer the
question, possibly involving considerations like numbers comes from
God, which are questionable but not irrelevant. They may help to
understand, or may not.

 P: 01010010010101
 R: 010001

 Yes … but this is premature optimization …

There is no premature optimization here, and it is fair to ask what we
may possibly want to try to optimize. My answer is: first, it must
take less space (given the probability) because we have to download
the repo metadata every time we run the update; second, set-versions
must compare quickly (we must be able to compare them all within a
second). What's your suggestion? Do you want them to take more space
and compare slowly, so as to avoid premature optimizations? :-) Or do
you dislike them exactly because they are not Bloom filters? Go on
then, make shorter strings, given the same probability, which compare
faster! But this is largely impossible, exactly because of these
laughable topics like what is a number or what is a probability
which I'm trying to present.

 How big the repo is determines how important the
 distro is and nothing else. Just look at how impo'tent Debian is …

Your considerations are probably true but are largely irrelevant to
the discussion.
Besides that, I can't understand them. Are you sarcastic? I am
sarcastic too, that's not at all a problem. Shall we talk? :-)

 $ print -l sym1 sym2 |/usr/lib/rpm/mkset 10
 set:dbUnz4
 $ print -l sym1 sym2 |/usr/lib/rpm/mkset 20
 set:nl2ZEALdS

 In the first run, the probability is modest (print -l is zsh syntax
 to print an arg per line). In the second run, the probability is much
 higher (and it takes more space). There is also a function in scripts
 to sustain probabilities at about o 0.1%.

 Hmmm … I've been defaulting Bloom filters ultra conservatively
 at 10**-4 and mobing to 10**-6 at the first hint of a possible problem.

The number does not matter here, the only consideration was that
2^{-10} was kind of cute number, and it works well in practice - that
is, we catch all or most of the bugs where a library symbol has been
deleted. But it could also be 2^{-16} or even 2^{-20}. It is a
trade-off, and I see no ground for criticizing it for just that. Or
again it boils down to the question how many megabytes you expect to
download when you run the update.
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-22 Thread Jeffrey Johnson

On Jun 22, 2012, at 5:26 PM, Alexey Tourbin alexey.tour...@gmail.com wrote:

 On Fri, Jun 22, 2012 at 5:30 AM, Jeffrey Johnson n3...@me.com wrote:
 Sure numbers make sense.
 
 But God invented 0 and 1 and who needs steenkin carries to do
 arithmetic in Galois fields?!?
 
 Jeffery, I understand that you are ironic and sarcastic, but I can't
 see the reason why, as per our discussion. If you ask honest
 questions, like what this means, I try to do my best to answer the
 question, possibly involving considerations like numbers comes from
 God, which are questionable but not irrelevant. They may help to
 understand, or may not.
 

Apologies for the sarcasm. My comments were inline with your observations
regarding undergraduate difficulties with Galois fields and matrices imho.

But math jokes are bone dry …


 P: 01010010010101
 R: 010001
 
 Yes … but this is premature optimization …
 
 There is no premature optimization here, and it is fair to ask what we
 may possibly want to try to optimize. My answer is: first, it must
 take less space (given the probability) because we have to download
 the repo metadata every time we run the update; second, set-versions
 must compare quickly (we must be able to compare them all within a
 second). What's your suggestion? Do you want them to take more space
 and compare slowly, so as to avoid premature optimizations? :-) Or do
 you dislike them exactly because they are not Bloom filters? Go on
 then, make shorter strings, given the same probability, which compare
 faster! But this is largely impossible, exactly because of these
 laughable topics like what is a number or what is a probability
 which I'm trying to present.
 

I would state that compression (of any sort) to minimize
bandwidth is entirely the wrong problem to solve.

Even in Kazakistan, one can minimize bandwidth by
Upgrading less often!


My contrarian POV should not be taken as opposed to compression
or elegance or anything else. Just that minimal size in the
representation of bit fields (or bit positions as numbers)
overly limits the applicability of set:versions.

There are lots of usage cases for efficient sub-set computations
in package management, not just as a de facto API/ABI check
using ELF symbols. Most of the other usage cases for efficient
sub-set computations are not subject to an ultimately optimal
string encoding.

E.g. ripping off the base62 encoding and distributing binary
assertion fields saves at least as much bandwidth as your
guesstimate that a Golomb-Rice code is ~20% more efficient
than a Bloom filter.

So don't take my flip comments as being opposed to what you have
implemented.

 How big the repo is determines how important the
 distro is and nothing else. Just look at how impo'tent Debian is …
 
 Your considerations are probably true but are largely irrelevant to
 the discussion.
 Besides that, I can't understand them. Are you sarcastic? I am
 sarcastic too, that's not at all a problem. Shall we talk? :-)
 

Just in case: yes acidic sarcasm was fully intended.

 $ print -l sym1 sym2 |/usr/lib/rpm/mkset 10
 set:dbUnz4
 $ print -l sym1 sym2 |/usr/lib/rpm/mkset 20
 set:nl2ZEALdS
 
 In the first run, the probability is modest (print -l is zsh syntax
 to print an arg per line). In the second run, the probability is much
 higher (and it takes more space). There is also a function in scripts
 to sustain probabilities at about o 0.1%.
 
 Hmmm … I've been defaulting Bloom filters ultra conservatively
 at 10**-4 and mobing to 10**-6 at the first hint of a possible problem.
 
 The number does not matter here, the only consideration was that
 2^{-10} was kind of cute number, and it works well in practice - that
 is, we catch all or most of the bugs where a library symbol has been
 deleted. But it could also be 2^{-16} or even 2^{-20}. It is a
 trade-off, and I see no ground for criticizing it for just that. Or
 again it boils down to the question how many megabytes you expect to
 download when you run the update.

You already made a valid point asking what probability means wrto
false positives. In fact your 2**10/2**20 is a very different
estimate of false positive probability than the approx. currently
in use in rpmbf.c (which is based on a statistical model for
Bloom filter parameters).

Below is what is in use by RPM which chooses {m, k} given
an estimated population n and a desired probability of false
positives e (I forget where I found the actual computation,
can likely find it again if/when necessary).

What I would like to be able to do with set:versions
in RPM is to be able to use either of these 2 container
representations interchangeably.

I'd also like to be able to pre-compute either form
in package tags for whatever purpose without the
dreadful tyranny of being legacy compatible with
older versions of RPM. Adding a new tag that isn't
used by older RPM is exactly as compatible as following
existing string representations for NEVR dependencies
in existing tags, 

Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-22 Thread Alexey Tourbin
On Sat, Jun 23, 2012 at 1:55 AM, Jeffrey Johnson n3...@me.com wrote:
 I would state that compression (of any sort) to minimize
 bandwidth is entirely the wrong problem to solve.

So what kind of a problem are we trying to solve? Why are you making
all these small puns? Are you ready to go out and pronounce that I am
the Jefferey Johnson is entirely confident that whichever problem we
face we must address, but to minimize bandwidth is entirely wrong and
shall be addressed never! :-)

Why is it any wrong to minimize bandwidth, or, in other words, why it
is bad to spend less money? Your answer is like, because the meaning
of life is not to spend less money, which is a wrong perspective.
Okay, but what's a better perspective? Spending more money for less
good is not at all a better perspective.

Okay, but what do we actually try to do? Um, we try to bring some
binary compatibility by using the limited amount of information with
some contrived data structures which test set-subset relation. If
space were not a problem, we could simply plug ELF sections into RPM
headers, couldn't we? Why not? And why is there a distinction between
the header and the payload? :-)

 My contrarian POV should not be taken as opposed to compression
 or elegance or anything else. Just that minimal size in the
 representation of bit fields (or bit positions as numbers)
 overly limits the applicability of set:versions.

 There are lots of usage cases for efficient sub-set computations
 in package management, not just as a de facto API/ABI check
 using ELF symbols. Most of the other usage cases for efficient
 sub-set computations are not subject to an ultimately optimal
 string encoding.

My POV is that I ask how the best I can do what I need to do, is it
doable, what is the price, can it be reduced, etc. Of course, these
best to do and need to do terms are not exactly mathematical, and
I already hear some laughs in the audience. Nevertheless, the only
thing which you can oppose to this is simply a better implementation.
Anything else does not count. Why? That's because! Go tell people they
should spend more money. ;-)

 E.g. ripping off the base62 encoding and distributing binary
 assertion fields saves at least as much bandwidth as your
 guesstimate that a Golomb-Rice code is ~20% more efficient
 than a Bloom filter.

I see no point in criticizing base62 encoding being suboptimal as
compared to raw bytes, because it does just that: squeezes bits into
alnum characters. It is pretty clear and pointless that raw bytes
stuff more bits. The goal was just that - to dump bits into
readable and usable form.

 Just in case: yes acidic sarcasm was fully intended.

There is no point in your sarcasm, except that probably that you are
very clever (which I totally agree).

 You already made a valid point asking what probability means wrto
 false positives. In fact your 2**10/2**20 is a very different
 estimate of false positive probability than the approx. currently
 in use in rpmbf.c (which is based on a statistical model for
 Bloom filter parameters).

The 2**10/2**20 is the best estimate and you probably cannot further
improve it. The whole business of set-versions, which I have been
pondering recently, boils down to the questions, can you improve the
ratio just a little bit? Or can you pay just a little bit less for a
little bit more? Is there a better data structure? The answer is
basically turns out to be NO. The set of numbers is good to go,
and the Galois fields have some discrepancies which you do not want to
know, and the benefit is very marginal.

 Below is what is in use by RPM which chooses {m, k} given
 an estimated population n and a desired probability of false
 positives e (I forget where I found the actual computation,
 can likely find it again if/when necessary).

 What I would like to be able to do with set:versions
 in RPM is to be able to use either of these 2 container
 representations interchangeably.

 I'd also like to be able to pre-compute either form
 in package tags for whatever purpose without the
 dreadful tyranny of being legacy compatible with
 older versions of RPM. Adding a new tag that isn't
 used by older RPM is exactly as compatible as following
 existing string representations for NEVR dependencies
 in existing tags, arguably more compatible because
 older versions of RPM are incapable of interpreting
 unknown tags.

 And again: the best solution for the download bandwidth
 problem is
        Update the metadata less often.
 if you think a bit.

The best solution for don't be stupid is just don't be stupid. Of
course, downloading metadata is not the only problem, and not exactly
the one which we exactly try to solve. It all combines into a complex
system. The question is then, to me, can we do any better? How much
does this whole business cost? if it's still expensive, which it is,
how can we cut down on the costs? These kind of questions I am willing
to discuss with the great and unspeakable joy in my heart, 

Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-22 Thread Jeffrey Johnson

On Jun 22, 2012, at 8:02 PM, Alexey Tourbin alexey.tour...@gmail.com wrote:

 On Sat, Jun 23, 2012 at 1:55 AM, Jeffrey Johnson n3...@me.com wrote:
 I would state that compression (of any sort) to minimize
 bandwidth is entirely the wrong problem to solve.
 
 So what kind of a problem are we trying to solve? Why are you making
 all these small puns? Are you ready to go out and pronounce that I am
 the Jefferey Johnson is entirely confident that whichever problem we
 face we must address, but to minimize bandwidth is entirely wrong and
 shall be addressed never! :-)
 

For the record: there is no intentional pun in that quote: I
tried to state a misdirection that cannot (and SHOULD not) be
solved in the context of package management, nothing more.

And I do feel for PTT's in Kzhakistan extracting every ruble
for bandwidth.

 Why is it any wrong to minimize bandwidth, or, in other words, why it
 is bad to spend less money? Your answer is like, because the meaning
 of life is not to spend less money, which is a wrong perspective.
 Okay, but what's a better perspective? Spending more money for less
 good is not at all a better perspective.
 

Nothing I said implies that bandwidth reduction is wrong.

If you want to minimize bandwidth downloading packages compress
the metadata and deal with the legacy compatible issues however
you want/need.

My rule of thumb is that metadata is ~15% of the size of a
typical *.rpm package: assuming 50% compression one might save
7.5% of the package download cost (0.50 * 0.15 = 7.5%).

The real problem is the perceived need to be up-to-date
as of this second, and the duplicated/redundant/necessary
monolithic repository metadata download is an expensive way to
determine
Nothing to upgrade: relax.
An incremental download of changed metadata is simple
engineering (not widely practiced).

 Okay, but what do we actually try to do? Um, we try to bring some
 binary compatibility by using the limited amount of information with
 some contrived data structures which test set-subset relation. If
 space were not a problem, we could simply plug ELF sections into RPM
 headers, couldn't we? Why not? And why is there a distinction between
 the header and the payload? :-)
 

Hmm .. we have different POV here: I have the luxury of being
able to change anything in RPM that makes sound engineering sense,
which perhaps warps my opinions.

Does it make engineering sense to include ELF headers in their
entirety in *.rpm metadata?
No.

Does it make sense to attempt set:versions for de facto ABI subsets
(if done carefully)?
Yes.

I *like* set:versions or we would not be having this conversation.

 My contrarian POV should not be taken as opposed to compression
 or elegance or anything else. Just that minimal size in the
 representation of bit fields (or bit positions as numbers)
 overly limits the applicability of set:versions.
 
 There are lots of usage cases for efficient sub-set computations
 in package management, not just as a de facto API/ABI check
 using ELF symbols. Most of the other usage cases for efficient
 sub-set computations are not subject to an ultimately optimal
 string encoding.
 
 My POV is that I ask how the best I can do what I need to do, is it
 doable, what is the price, can it be reduced, etc. Of course, these
 best to do and need to do terms are not exactly mathematical, and
 I already hear some laughs in the audience. Nevertheless, the only
 thing which you can oppose to this is simply a better implementation.
 Anything else does not count. Why? That's because! Go tell people they
 should spend more money. ;-)
 

again: I *like* set:versions and see many usage cases, not just in dependencies.

 E.g. ripping off the base62 encoding and distributing binary
 assertion fields saves at least as much bandwidth as your
 guesstimate that a Golomb-Rice code is ~20% more efficient
 than a Bloom filter.
 
 I see no point in criticizing base62 encoding being suboptimal as
 compared to raw bytes, because it does just that: squeezes bits into
 alnum characters. It is pretty clear and pointless that raw bytes
 stuff more bits. The goal was just that - to dump bits into
 readable and usable form.
 

I did not say sub-optimal: I said that the encoding into a string
entails a cost that is largely the same as the difference between
numbers in set:versions and bits in Bloom filters.

I can tie binary set:versions into traditional EVR tags if I choose
to attempt entirely and entirely remove the need for base62 encoding
if necessary. So could you.

There is no desire to break legacy compatibility needlessly however;
base63 is as good as its gonna get if one MUST encode into a string.

 Just in case: yes acidic sarcasm was fully intended.
 
 There is no point in your sarcasm, except that probably that you are
 very clever (which I totally agree).
 

You miss the point of the sarcasm: I don't give a h00t whether
I'm clever: I do believe that Debian focus on compatibility
is 

Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-21 Thread Jeffrey Johnson

On Jun 21, 2012, at 12:51 AM, Alexey Tourbin wrote:

 On Thu, Jun 21, 2012 at 12:15 AM, Alexey Tourbin
 alexey.tour...@gmail.com wrote:
 On Mon, Jun 18, 2012 at 10:32 PM, Jeffrey Johnson n3...@me.com wrote:
 Good: the above confirmation of the characteristics allows a set:versions
 implementation to proceed.
 
 Hello, there's been some speculation about Bloom filters below, which
 I cannot address right now, offhand. Nevertheless, I can say that, in
 some highly mathematical sense, set-versions are exactly equivalent to
 Bloom filters. They do just the same thing, if you will. The only
 difference is that set-versions are more compact: they take somewhat
 less space, which was, if you remember, the number one goal of the
 original implementation.
 
 More precisely, a set-version can be (in principle) converted to a
 Bloom filter which uses only one hash function. The idea is that such
 a filter will set bits in a highly sparse set of bits, one by one.
 Instead, a set-version remembers the bits simply by their indices.
 Setting the bit becomes taking the number, and there is a
 straightforward correspondence. It also turns out that a set of
 numbers can be easier to deal with, as opposed to a sparse set of
 bits.
 

You've made an unsupported claim here re numbers can be easier.
Presumably you are talking about means of encoding, where clever
mathematics derived from numerical transforms can be used to
remove redundancy. With a pile of bits in a Bloom filter all one
has to work with is a pile of bits.

The counter argument is this: testing bits (in a Bloom filter) is
rather KISS. Other compression/redundancy removal schemes
end up trading storage for implementation complexity cost.
In a running installer like RPM, there will always be a need
for memory dealing with payload contents. Since large amounts
of memory are eventually going to be needed/used, savings
by using Golomb-Reid codes are mostly irrelevant for the
100K - 1M set sizes used by RPM no matter whether bits or
numbers are used to represent.

But don't take my argument as being opposed to set:versions whatsoever.
Just (for RPM and even for *.rpm) that compression size isn't as important
to me as it is to you as the primary goal of set: versions.

 If you want to know more why things have to work like this, and e.g.
 where constants pop up, there is a good starting point at Cache-,
 Hash-, and Space-efficient Bloom filters paper by Felix Putze, Peter
 Sanders, and Johannes Singler. Actually this paper helped me a lot to
 put things together and to produce the original implementation.
 Reading this paper requires some working mathematical knowledge,
 though. This requirement must not be underestimated, but also should
 not be overestimated. The paper is very readable: it tells you what
 you may want to do and what you have to do.
 http://algo2.iti.kit.edu/singler/publications/cacheefficientbloomfilters-wea2007.pdf

Thanks for the pointer. I actually read and studied this paper a few years back
trying to decide whether there was a need to attempt cache-line optimization
(but I totally ignored the paragraph that links Golomb codes to Bloom filters).

At the time I decided that the complexity cost to attempt cache-line 
optimization
wasn't a priority in RPM, but rather KISS simplicity while users considered
changes that involve a finite probability of false positive failures.

AFAICT set:versions also has a finite probability of failure due to (relatively 
rare)
hash collisions converting strings to numbers.

Is there any hope that set:versions (or the Golomb-Rice code) can supply a 
parameterization
to tune the probability of false positives?

Or am I mis-understanding what is implemented?

73 de Jeff
 __
 RPM Package Managerhttp://rpm5.org
 Developer Communication Listrpm-devel@rpm5.org

__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-21 Thread Jeffrey Johnson

On Jun 18, 2012, at 2:32 PM, Jeffrey Johnson wrote:

 
 The contained in or subset semantic that applies to the operations  and 
 =
 is rather easy to do as well. E.g. if (assuming on;y existence, not versioned 
 inequality ranges)
   P == Bloom filter of Provides: tokens
   R == Bloom filter of a (possible) subset of tokens
 then the subset computation is nothing more than
   (P  R) == R
 

I should point out the implicit (and tricky) assumption regarding
using Bloom filters to easily compute set and union intersection:
Both P and R MUST have exactly the same parameterization.
Choosing an a priori parametrization is hard because it MUST
deal with the worst possible case of the largest estimated population
which increases the sparseness of all other Bloom filters.

This also applies to tuning to reduce the probability of false positives.

There are no obvious ways to rescale a Bloom filter meaningfully either.

Do set: versions have the same difficulty (even if the parameters are hidden
as implementation/design constraints somehow)?

Or does the Golomb-Rice encoding scale more naturally than Bloom filter 
parameters?
If the set:versions implementation scales better than Bloom filters (which 
seems to be the case),
then there are lots of usage cases (like packages in a repository, or files in 
a package) where
an efficient means to compute set membership is quite useful.

(aside)
Apologies for scales imprecision. I'm merely trying to ask is
To what extent does set:versions depend on a priori assumptions?

BTW, what is the current false positive failure probability for set:versions?


73 de Jeff
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-21 Thread Alexey Tourbin
On Thu, Jun 21, 2012 at 7:28 PM, Jeffrey Johnson n3...@me.com wrote:
 More precisely, a set-version can be (in principle) converted to a
 Bloom filter which uses only one hash function. The idea is that such
 a filter will set bits in a highly sparse set of bits, one by one.
 Instead, a set-version remembers the bits simply by their indices.
 Setting the bit becomes taking the number, and there is a
 straightforward correspondence. It also turns out that a set of
 numbers can be easier to deal with, as opposed to a sparse set of
 bits.

 You've made an unsupported claim here re numbers can be easier.
 Presumably you are talking about means of encoding, where clever
 mathematics derived from numerical transforms can be used to
 remove redundancy. With a pile of bits in a Bloom filter all one
 has to work with is a pile of bits.

Numbers are easier to deal with because you can use 'unsigned v[]'
array to represent them, which can be seen throughout the code,
accompanied by 'int c' (which is argc/argv[] style). Bits are somewhat
more complicated: you need to use something like sys/select.h
macros, and you cannot use pointer arithmetic to implement traversal,
etc.

Also, to me, a set of numbers just makes sense, and is good to go.
If you wonder how much complicated things can get, there's been some
recent papers out there which use matrix solving for approximate set
representation. They are basically unreadable (given the undergraduate
skills), and you can be completely lost because there is apparently no
connection between the Galois fields and what you actually try to do.
After you try to read these papers, the set of numbers becomes a
wonderful salvation which comes directly from God and the Holy Spirit
and brings peace to your soul. This wonderful salvation, which comes
from God, turns out to be a good thing. You should use it, when you
have some. :-)

 The counter argument is this: testing bits (in a Bloom filter) is
 rather KISS. Other compression/redundancy removal schemes
 end up trading storage for implementation complexity cost.
 In a running installer like RPM, there will always be a need
 for memory dealing with payload contents. Since large amounts
 of memory are eventually going to be needed/used, savings
 by using Golomb-Reid codes are mostly irrelevant for the
 100K - 1M set sizes used by RPM no matter whether bits or
 numbers are used to represent.

I do not agree that testing bits is more KISS than a set of numbers.
It you think that testing bits is the only and natural thing to do, I
disagree. The reason is: Requires versions are much more sparse than
Provides versions. If the Provides version is optimal, in terms of bit
usage (50% set), the Requires version must be necessarily suboptimal
(too few bits set). It can be pictured like this:

P: 01010010010101
R: 010001

The conclusion is, if you want to stick to bitmasks and reduce subset
comparison to bitmask tests, there will be a great deal of
inefficiency because of sparse R-sets. On the other hand, sets of
numbers can be pictured like this:

P: 1,3,7,13,15
R: 1,13

Note that R-set does not take extra space, and there is a simple
merge-like algorithm which advances either R or P (or both - which is
how rpmsetcmp routine is implemented).

 But don't take my argument as being opposed to set:versions whatsoever.
 Just (for RPM and even for *.rpm) that compression size isn't as important
 to me as it is to you as the primary goal of set: versions.

The size of *.rpm is very important when you want to update the
information about a repo, like in apt-get update - you have to
download it all. The question is then how big the repo is, and how
many set-versions you can afford to download. :-) The primary goal
was not to make things much worse. In other words, the price must not
be prohibitive even for the repo of 10k or 15k packages strong.
(Please also realize that the Internet was not a given until very
recent times - some shoddy ISPs in Kamchatka still want to charge
something like $0.1 per 1Mb.)

 AFAICT set:versions also has a finite probability of failure due to 
 (relatively rare)
 hash collisions converting strings to numbers.

 Is there any hope that set:versions (or the Golomb-Rice code) can supply a 
 parameterization
 to tune the probability of false positives?

 Or am I mis-understanding what is implemented?

You can tune the probability by selecting the appropriate bpp
paramter, and passing it to mkset (which is how scripts work). The
bpp indicates how much bits must be used per hash value (the range
is 10..32). For example, if you want to encode Provides versions which
has 1024 symbols, and the error rate has to be fixed at about 0.1%,
you need to use bpp=20 - that is, after each symbol is hashed, its 20
lower bits will be taken into account and encoded. The probability
then is simply the number of symbols over the capacity of the
universe, which is 2^{10}/2^{20}=0.1%. On the other hand, when you
encode Requires versions, you 

Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-21 Thread Jeffrey Johnson

On Jun 21, 2012, at 8:27 PM, Alexey Tourbin alexey.tour...@gmail.com wrote:

 On Thu, Jun 21, 2012 at 7:28 PM, Jeffrey Johnson n3...@me.com wrote:
 More precisely, a set-version can be (in principle) converted to a
 Bloom filter which uses only one hash function. The idea is that such
 a filter will set bits in a highly sparse set of bits, one by one.
 Instead, a set-version remembers the bits simply by their indices.
 Setting the bit becomes taking the number, and there is a
 straightforward correspondence. It also turns out that a set of
 numbers can be easier to deal with, as opposed to a sparse set of
 bits.
 
 You've made an unsupported claim here re numbers can be easier.
 Presumably you are talking about means of encoding, where clever
 mathematics derived from numerical transforms can be used to
 remove redundancy. With a pile of bits in a Bloom filter all one
 has to work with is a pile of bits.
 
 Numbers are easier to deal with because you can use 'unsigned v[]'
 array to represent them, which can be seen throughout the code,
 accompanied by 'int c' (which is argc/argv[] style). Bits are somewhat
 more complicated: you need to use something like sys/select.h
 macros, and you cannot use pointer arithmetic to implement traversal,
 etc.
 

Well different strokes here. The
#include sys/select.h
is complicated only because of effete Linux kernel debutantes
with their precious kernel stack advocacy issues re select(2).
These minor defines (swiped directly from sys/select.h mind you)

are hugely portable:
#define __PBM_NBITS /*@-sizeoftype@*/(8 * sizeof(__pbm_bits))/*@=sizeoftype@*/
#define __PBM_IX(d) ((d) / __PBM_NBITS)
#define __PBM_MASK(d)   ((__pbm_bits) 1  (((unsigned)(d)) % __PBM_NBITS))
#define __PBM_BITS(set) ((__pbm_bits *)(set)-bits)

#define PBM_FREE(s) _free(s);
#define PBM_SET(d, s)   (__PBM_BITS (s)[__PBM_IX (d)] |= __PBM_MASK (d))
#define PBM_CLR(d, s)   (__PBM_BITS (s)[__PBM_IX (d)] = ~__PBM_MASK (d))
#define PBM_ISSET(d, s) ((__PBM_BITS (s)[__PBM_IX (d)]  __PBM_MASK (d)) != 0)

#define PBM_ALLOC(d)xcalloc(__PBM_IX (d) + 1, __PBM_NBITS/8)

See rpmio/rpmbf.h.

 Also, to me, a set of numbers just makes sense, and is good to go.
 If you wonder how much complicated things can get, there's been some
 recent papers out there which use matrix solving for approximate set
 representation. They are basically unreadable (given the undergraduate
 skills), and you can be completely lost because there is apparently no
 connection between the Galois fields and what you actually try to do.
 After you try to read these papers, the set of numbers becomes a
 wonderful salvation which comes directly from God and the Holy Spirit
 and brings peace to your soul. This wonderful salvation, which comes
 from God, turns out to be a good thing. You should use it, when you
 have some. :-)
 

Sure numbers make sense.

But God invented 0 and 1 and who needs steenkin carries to do
arithmetic in Galois fields?!?

 The counter argument is this: testing bits (in a Bloom filter) is
 rather KISS. Other compression/redundancy removal schemes
 end up trading storage for implementation complexity cost.
 In a running installer like RPM, there will always be a need
 for memory dealing with payload contents. Since large amounts
 of memory are eventually going to be needed/used, savings
 by using Golomb-Reid codes are mostly irrelevant for the
 100K - 1M set sizes used by RPM no matter whether bits or
 numbers are used to represent.
 
 I do not agree that testing bits is more KISS than a set of numbers.
 It you think that testing bits is the only and natural thing to do, I
 disagree. The reason is: Requires versions are much more sparse than
 Provides versions. If the Provides version is optimal, in terms of bit
 usage (50% set), the Requires version must be necessarily suboptimal
 (too few bits set). It can be pictured like this:
 
 P: 01010010010101
 R: 010001
 

Yes … but this is premature optimization …

 The conclusion is, if you want to stick to bitmasks and reduce subset
 comparison to bitmask tests, there will be a great deal of
 inefficiency because of sparse R-sets. On the other hand, sets of
 numbers can be pictured like this:
 
 P: 1,3,7,13,15
 R: 1,13
 
 Note that R-set does not take extra space, and there is a simple
 merge-like algorithm which advances either R or P (or both - which is
 how rpmsetcmp routine is implemented).
 

… if you are serious abtout minimal size, compressing headers
in *.rpm packages will save more bandwidth than any single other
optimization one might choose to implement.

 But don't take my argument as being opposed to set:versions whatsoever.
 Just (for RPM and even for *.rpm) that compression size isn't as important
 to me as it is to you as the primary goal of set: versions.
 
 The size of *.rpm is very important when you want to update the
 information about a repo, like in apt-get update - you have to
 download it all. The question is then how big the 

Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-20 Thread Alexey Tourbin
On Mon, Jun 18, 2012 at 10:32 PM, Jeffrey Johnson n3...@me.com wrote:
 Good: the above confirmation of the characteristics allows a set:versions
 implementation to proceed.

Hello, there's been some speculation about Bloom filters below, which
I cannot address right now, offhand. Nevertheless, I can say that, in
some highly mathematical sense, set-versions are exactly equivalent to
Bloom filters. They do just the same thing, if you will. The only
difference is that set-versions are more compact: they take somewhat
less space, which was, if you remember, the number one goal of the
original implementation.

Very informal, if you want to encode 1K symbols out of 1M symbols
using a bloom filter, you need at least
1024 * 1.44 * 10 bits of information (which is a factor of 1.44 per symbol)
while with a set-version you need at least
1024 * (10 + 1.44) bits of information (which is an additive constant
of 1.44 per symbol).
It is basically fair to say that set-versions are 20% shorter than the
equivalent bloom filters, which is not unimportant.
By the way, the information-theoretical minimum is
1024 * 10 bits of information -
you cannot go beyond that without violating fundamental laws.
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-20 Thread Alexey Tourbin
On Thu, Jun 21, 2012 at 12:15 AM, Alexey Tourbin
alexey.tour...@gmail.com wrote:
 On Mon, Jun 18, 2012 at 10:32 PM, Jeffrey Johnson n3...@me.com wrote:
 Good: the above confirmation of the characteristics allows a set:versions
 implementation to proceed.

 Hello, there's been some speculation about Bloom filters below, which
 I cannot address right now, offhand. Nevertheless, I can say that, in
 some highly mathematical sense, set-versions are exactly equivalent to
 Bloom filters. They do just the same thing, if you will. The only
 difference is that set-versions are more compact: they take somewhat
 less space, which was, if you remember, the number one goal of the
 original implementation.

More precisely, a set-version can be (in principle) converted to a
Bloom filter which uses only one hash function. The idea is that such
a filter will set bits in a highly sparse set of bits, one by one.
Instead, a set-version remembers the bits simply by their indices.
Setting the bit becomes taking the number, and there is a
straightforward correspondence. It also turns out that a set of
numbers can be easier to deal with, as opposed to a sparse set of
bits.

If you want to know more why things have to work like this, and e.g.
where constants pop up, there is a good starting point at Cache-,
Hash-, and Space-efficient Bloom filters paper by Felix Putze, Peter
Sanders, and Johannes Singler. Actually this paper helped me a lot to
put things together and to produce the original implementation.
Reading this paper requires some working mathematical knowledge,
though. This requirement must not be underestimated, but also should
not be overestimated. The paper is very readable: it tells you what
you may want to do and what you have to do.
http://algo2.iti.kit.edu/singler/publications/cacheefficientbloomfilters-wea2007.pdf
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-18 Thread Jeffrey Johnson

On Jun 15, 2012, at 11:58 PM, Alexey Tourbin wrote:

 On Fri, Jun 1, 2012 at 6:07 AM, Jeffrey Johnson n3...@me.com wrote:
 I asked 2 very specific questions … the rest is quite important also,
 but I need to understand precisely what properties set:versions have in order
 to implement correctly (and I don't fully understand your reply).
 
 Specifically:
 
1) Is the set:versions VERSION independent of the order of the
calls to rpmsetAdd()? (you know the routine as set_add())
 
 Completely independent - you can add symbols in any order. The symbols
 are then hashed and sorted by their numeric values. The underlying
 idea is that a set-version is just a (sorted) set of numbers. You can
 add whatever symbol to it, possibly twice, the symbol will be hashed
 to a number, in a unique manner, and finally you can get the string
 representation of the set of numbers. This involves much fuss under
 the hood, but basically, you should think of the set of symbols, which
 is just the set of numbers, after each symbol has been hashed
 individually.
 
2) Can the set:versions encoding be compared for more than equality?
What set/arithmetic property is the basis for the comparison? What
circumstances/constraints are there related to
… You cannot always compare
set-versions in terms of greater or equal (but when you 
 can, it's
important).
 
 Set-versions compare as sets. There are Euler diagrams to visaulise
 set comparsion, which is an undergraduate matetrial. The idea is that,
 real numbers are linear order: you can always tell either V1V2 or
 V1=V2. Sets are quite another matter: you cannot always apply for
 tertium non datur (either lt or ge). Which is to say that sets can
 be quite different and do not compare easily. The order can be imposed
 on the sets, though, by requiring the greater sets to have at least
 the same elements they compare against (perhaps I'm starting to retell
 the undergraduate material, which is not going to last). To sum up,
 there IS a mathematical basis behind Requires: foo = set:asdf
 dependencies.
 
 I can of course answer my own questions with try-and-see test cases.

Good: the above confirmation of the characteristics allows a set:versions
implementation to proceed.

What I am hearing is that set:versions is not fundamentally different from 
these transforms
Create a Bloom filter (with acceptable false positive rate)
Compress the Bloom filter parameters and bits.
Encode the compression in base{16,32,62,64} for transport.

The contained in or subset semantic that applies to the operations  and 
=
is rather easy to do as well. E.g. if (assuming on;y existence, not versioned 
inequality ranges)
P == Bloom filter of Provides: tokens
R == Bloom filter of a (possible) subset of tokens
then the subset computation is nothing more than
(P  R) == R

The important difference is that Bloom filters have a false positive failure 
probability which
is likely not present in the Golomb-Rice coding in the current implementation.

I'm expecting that Bloom+Compression+Encoding is comparable in size to the 
Golomb-Rice
encoding. The only way to tell (that I know of) is to try-and-see, which I will 
likely also do in
order to tell how good the compression is with Golomb-Rice coding.

Meanwhile there's lots of usage cases for the set:versions implementation in 
RPM, not just
as a container of Elf symbols that is a de facto ABI sub-set.

73 de Jeff

 __
 RPM Package Managerhttp://rpm5.org
 Developer Communication Listrpm-devel@rpm5.org

__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-06-15 Thread Alexey Tourbin
On Fri, Jun 1, 2012 at 6:07 AM, Jeffrey Johnson n3...@me.com wrote:
 I asked 2 very specific questions … the rest is quite important also,
 but I need to understand precisely what properties set:versions have in order
 to implement correctly (and I don't fully understand your reply).

 Specifically:

        1) Is the set:versions VERSION independent of the order of the
        calls to rpmsetAdd()? (you know the routine as set_add())

Completely independent - you can add symbols in any order. The symbols
are then hashed and sorted by their numeric values. The underlying
idea is that a set-version is just a (sorted) set of numbers. You can
add whatever symbol to it, possibly twice, the symbol will be hashed
to a number, in a unique manner, and finally you can get the string
representation of the set of numbers. This involves much fuss under
the hood, but basically, you should think of the set of symbols, which
is just the set of numbers, after each symbol has been hashed
individually.

        2) Can the set:versions encoding be compared for more than equality?
        What set/arithmetic property is the basis for the comparison? What
        circumstances/constraints are there related to
                        … You cannot always compare
                set-versions in terms of greater or equal (but when you can, 
 it's
                important).

Set-versions compare as sets. There are Euler diagrams to visaulise
set comparsion, which is an undergraduate matetrial. The idea is that,
real numbers are linear order: you can always tell either V1V2 or
V1=V2. Sets are quite another matter: you cannot always apply for
tertium non datur (either lt or ge). Which is to say that sets can
be quite different and do not compare easily. The order can be imposed
on the sets, though, by requiring the greater sets to have at least
the same elements they compare against (perhaps I'm starting to retell
the undergraduate material, which is not going to last). To sum up,
there IS a mathematical basis behind Requires: foo = set:asdf
dependencies.

 I can of course answer my own questions with try-and-see test cases.
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-05-31 Thread Michael Schroeder
On Wed, May 30, 2012 at 02:56:20PM -0400, Jeffrey Johnson wrote:
 
 On Apr 23, 2012, at 10:32 AM, Jeff Johnson wrote:
 
  I should point out that writing the attached
  message (and sending from the wrong e-mail address) has instantly
  led to a different -- and perhaps more natural -- syntax like
  
  Requires: set(libfoo.so.1) = whatever
  
 
 After a month of quietly muddling, I like this namespace like syntax
 better because it doesn't lead to magic tokens like set: in the EVR
 string and is more consistent with existing namespaces in RPM dependencies.

Is that just syntax? I.e., will rpm -q --whatprovides libfoo.so.1
still find the package?

Cheers,
  Michael.

-- 
Michael Schroeder   m...@suse.de
SUSE LINUX Products GmbH,  GF Jeff Hawn, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-05-31 Thread Jeffrey Johnson

On May 31, 2012, at 5:07 AM, Michael Schroeder wrote:

 On Wed, May 30, 2012 at 02:56:20PM -0400, Jeffrey Johnson wrote:
 
 On Apr 23, 2012, at 10:32 AM, Jeff Johnson wrote:
 
 I should point out that writing the attached
 message (and sending from the wrong e-mail address) has instantly
 led to a different -- and perhaps more natural -- syntax like
 
 Requires: set(libfoo.so.1) = whatever
 
 
 After a month of quietly muddling, I like this namespace like syntax
 better because it doesn't lead to magic tokens like set: in the EVR
 string and is more consistent with existing namespaces in RPM dependencies.
 
 Is that just syntax? I.e., will rpm -q --whatprovides libfoo.so.1
 still find the package?
 

Depends on what I choose to implement. It's quite easy to use
an additional per-namspace table, or to add two keys to the existing table, or
to treat set(…) as a probe method.

Because of the length/encoding in set:version data, --whatprovides is rather
awkward and uninformative for humans.

The namespace convention that exists in RPM is that in a statement like
Dependency: foo(N) = E:V-R
that E:V-R applies to the token N, not the wrapped name foo(N).

All other behaviors aren't specified.

73 de Jeff__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-05-31 Thread Alexey Tourbin
On Thu, May 31, 2012 at 9:03 AM, Jeffrey Johnson n3...@me.com wrote:
 The mixed code case is interesting: what happens
 if a set:version encoding contains the literal string 0:V-R

 I can't understand you question. A version is either a set-version, or
 not a set-version.  If a version is a set-version, it has to be
 prefixed by set:. Regular RPM versions cannot be prefixed by set:,
 because set cannot be decoded as a valid serial number, which has to
 be an integer. There might be some implementation sloppiness out
 there, but in principle, I believe the encoding scheme is sane, and
 makes sense.

 Now the question is, what if a set-version cannot be decoded? But that
 can be perplexed by a question, what if a regular rpm version cannot
 be decoded? Or can you decode any junk as a valid RPM version?

 I'm trying to understand rpmsetcmp() as a black box independent
 of all the gory implementation details of ELF symbols, base62 encoding,
 and RPM dependencies.

 I believe that set:versions are much like Bloom filters:
        1) strings can be added to a set in any order
        2) the comparison operation implied by
                Requires: foo = set:….
        is identical to contained in or is subset of
 Is that the case?

To me, a set-version is just a VERSION. I can't stress that enough.
When you need a library, it is a legitimate question to ask which
version you need. If you answer that you need at least version 1.0, a
conventional version, you must be kidding. Because there is no
connection between what you actually need and a god-damn number. So a
plausible answer must be Well, we need at least a version which
provides foo, bar, baz symbols which we need to use. Let's take
this approach to the extreme, and say that the VERSION which we need
is simply the one which provides at least foo, bar, baz symbols. To
me, this is much better an approach to library versioning, and
possibly the only viable approach. How else can you express your
expectations about a library? Suppose someone is talking to you at a
conference, and says Mr. Johnson, we are very proud, blah-blah-blah,
because blah-blah-blah. What you want to tell them is basically Go
out, folks, it works. Now, with set-versions, things really work. :-)

Back to implementation,
1) set-strings should be considered opaque, static, and unmodifiable.
Once they are formed, there is no useful way to alter them. They
express an idea of a VERSION in a manner which cannot be easily
comprehended, but that's not a problem (or otherwise there is
perplexing questions of what regular rpm versions must mean, and
whether any piece of junk can be decoded as a regular rpm version).
2) They must compare as sets, in terms of elements unique to the first
set, common elements, and elements unique to the second set. The
implementation does not quite much yet, because it already tries to
mimic regular versions. Regular versions are linear order.
Set-versions are partial order.  You cannot always compare
set-versions in terms of greater or equal (but when you can, it's
important).
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-05-31 Thread Jeffrey Johnson

On May 31, 2012, at 7:24 PM, Alexey Tourbin wrote:

 
 
 I'm trying to understand rpmsetcmp() as a black box independent
 of all the gory implementation details of ELF symbols, base62 encoding,
 and RPM dependencies.
 
 I believe that set:versions are much like Bloom filters:
1) strings can be added to a set in any order
2) the comparison operation implied by
Requires: foo = set:….
is identical to contained in or is subset of
 Is that the case?

snip
I asked 2 very specific questions … the rest is quite important also,
but I need to understand precisely what properties set:versions have in order
to implement correctly (and I don't fully understand your reply).

Specifically:

1) Is the set:versions VERSION independent of the order of the
calls to rpmsetAdd()? (you know the routine as set_add())

2) Can the set:versions encoding be compared for more than equality?
What set/arithmetic property is the basis for the comparison? What
circumstances/constraints are there related to
… You cannot always compare
set-versions in terms of greater or equal (but when you can, 
it's
important).

I can of course answer my own questions with try-and-see test cases.

73 de Jeff__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-05-30 Thread Jeffrey Johnson

On Apr 23, 2012, at 10:32 AM, Jeff Johnson wrote:

 I should point out that writing the attached
 message (and sending from the wrong e-mail address) has instantly
 led to a different -- and perhaps more natural -- syntax like
 
   Requires: set(libfoo.so.1) = whatever
 

After a month of quietly muddling, I like this namespace like syntax
better because it doesn't lead to magic tokens like set: in the EVR
string and is more consistent with existing namespaces in RPM dependencies.

I'll leave the existing set: prefix as is until I implement the set(…) syntax.

 for set:versions, and for the generalization (for writing strict regression 
 tests,
 its mostly useless in packaging because there is no mapping that specifies
 how the mixed DEB - RPM version comparison might be done naturally)
 
   Requires: deb(foo) = E:V-R
   Requires: rpm(foo) = E:V-R
 

(aside)
And this is a rather easy/natural extension to choose between alternative
version comparisons (at least for development/testing).

OTOH, its mostly useless for dependency assertion evaluation because there
is no clearly identified mixed comparison when
Provides:  deb(foo) = E:V-R
Requires: rpm(foo) = E:V-R
I suppose I can attempt a resolution like this
Prefer an explicit comparison in the Provides: if there is a conflict.
Prefer an explicit comparison over an implicit/default comparison
Use the default comparison if no explicit comparison hint is present.

 The precedent for foo(bar) name spacing in RPM dependencies with
 the above syntax is already widely deployed although entirely
 de facto.
 
 Sure would be nice _NOT_ to have to consider Have it your own way!
 competing syntaxes like
   Requires: libfoo.so.1 = set:whatever
 and
   Requires: set(libfoo.so.1) = whatever
 (and I'll ignore the gawdafulness of existing multilib dependencies like
   Requires: libfoo.ao.1()(64bit)
 coming along rapidly as a Newer! Better! Bestest! ABI standard now
 that there is support in kernel = 3.2.
 
 *shrug*
 
 Its all implementable, just not very KISS when aesthetic/innate opinions get 
 involved.
 

OK, I'm in the last stages of adding twiddle-in-version so its time to consider 
other changes:

1) Epoch as a string? The only lossage here is when digit strings 
differ in length
(and strcmp isn't identical to arithmetic comparison). Alternatively, 
should Epoch's
be forced to a [0-9]+ digit string?

2) Is there a need for
twiddle-in-epoch(assuming a string)
twiddle-in-release
twiddle-in-distepoch
There are/were several ways to implement twiddle-in-version; so far I've
chosen to do only twiddle-in-version because that involved little more
than rewriting some regex patterns. OTOH, handling twiddle-in-version
as a special case of a more general string comparison (aka rpmvercmp)
is at least as much of a problem as just permitting the full 
generalization.

3) revisiting dependency ranges with syntax like this
Provides: foo = [1.2-3|4.5-6)
which would match only Requires in the range, including 1.2-3 because 
of the '['
and excluding 4.5-6 because of the ')' with the usual mathematical 
conventions (in the US).

The problematic character would be choosing a separator (I used '|' in 
the example): almost
all the usual characters have pre-existing usages, but perhaps 
something could still be twitched together.

73 de Jeff
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-05-30 Thread Robert Xu
On Wed, May 30, 2012 at 2:56 PM, Jeff Johnson n3...@me.com wrote:

 OK, I'm in the last stages of adding twiddle-in-version so its time to 
 consider other changes:

        1) Epoch as a string? The only lossage here is when digit strings 
 differ in length
        (and strcmp isn't identical to arithmetic comparison). Alternatively, 
 should Epoch's
        be forced to a [0-9]+ digit string?

I don't see the usefulness of Epoch as a string. I would rather have
it just integers, higher takes precedence.
Much less mind boggling.


        2) Is there a need for
                twiddle-in-epoch(assuming a string)
                twiddle-in-release
                twiddle-in-distepoch
        There are/were several ways to implement twiddle-in-version; so far 
 I've
        chosen to do only twiddle-in-version because that involved little more
        than rewriting some regex patterns. OTOH, handling twiddle-in-version
        as a special case of a more general string comparison (aka rpmvercmp)
        is at least as much of a problem as just permitting the full 
 generalization.

        3) revisiting dependency ranges with syntax like this
                Provides: foo = [1.2-3|4.5-6)
        which would match only Requires in the range, including 1.2-3 because 
 of the '['
        and excluding 4.5-6 because of the ')' with the usual mathematical 
 conventions (in the US).

        The problematic character would be choosing a separator (I used '|' in 
 the example): almost
        all the usual characters have pre-existing usages, but perhaps 
 something could still be twitched together.


Can you explain twiddle-in-* to me? I'm slightly confused.
(Shows how much I haven't kept up with anything within like the
past... year .)

-- 
later daze. :: Robert Xu :: protocol.by/rxu
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-05-30 Thread Jeffrey Johnson

On May 30, 2012, at 3:09 PM, Robert Xu wrote:

 On Wed, May 30, 2012 at 2:56 PM, Jeff Johnson n3...@me.com wrote:
 
 OK, I'm in the last stages of adding twiddle-in-version so its time to 
 consider other changes:
 
1) Epoch as a string? The only lossage here is when digit strings 
 differ in length
(and strcmp isn't identical to arithmetic comparison). Alternatively, 
 should Epoch's
be forced to a [0-9]+ digit string?
 
 I don't see the usefulness of Epoch as a string. I would rather have
 it just integers, higher takes precedence.
 Much less mind boggling.
 

And Epoch: as integer requires special casing of an integer data type 
everywhere. strings
are easier for the implementation and for metadata data types.

I'd rather have a means to eliminate Epoch: … Epoch: is easily dropped
out everywhere by adding this to RPM configuration
%evr_tuple_order  VR
and all epochs will be parsed in dependency strings and ignored.

 
2) Is there a need for
twiddle-in-epoch(assuming a string)
twiddle-in-release
twiddle-in-distepoch
There are/were several ways to implement twiddle-in-version; so far 
 I've
chosen to do only twiddle-in-version because that involved little more
than rewriting some regex patterns. OTOH, handling twiddle-in-version
as a special case of a more general string comparison (aka rpmvercmp)
is at least as much of a problem as just permitting the full 
 generalization.
 
3) revisiting dependency ranges with syntax like this
Provides: foo = [1.2-3|4.5-6)
which would match only Requires in the range, including 1.2-3 because 
 of the '['
and excluding 4.5-6 because of the ')' with the usual mathematical 
 conventions (in the US).
 
The problematic character would be choosing a separator (I used '|' 
 in the example): almost
all the usual characters have pre-existing usages, but perhaps 
 something could still be twitched together.
 
 
 Can you explain twiddle-in-* to me? I'm slightly confused.

Go read the Debian documentation or ask SuSE where the need for 
twiddle-in-version is.

Perhaps the most complete description of already implemented alternatives 
@rpm5.org is here:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=256

I am forced to compatibility (with dpkg in Yocto) and feature parity 
(released in rpm-4.10.0) on an
unnecessary (imho) implementation complexity adding twiddle-in-version 
@rpm5.org.

The original RFE (from SuSE) is here
http://www.rpm.org/ticket/56

 (Shows how much I haven't kept up with anything within like the
 past... year .)
 

The twiddle-in-version RFE started in 2008, was proposed in the OpenSUSE RPM 
Summit ROADMAP on
September 18th, 2009, and has just been released.

That SHOULD tell you a lot about the urgency and necessity of 
twiddle-in-version in RPM.



73 de Jeff

Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-05-30 Thread Alexey Tourbin
On Mon, Apr 23, 2012 at 6:32 PM, Jeff Johnson n3npq@gmail.com wrote:
 I should point out that writing the attached
 message (and sending from the wrong e-mail address) has instantly
 led to a different -- and perhaps more natural -- syntax like

 Requires: set(libfoo.so.1) = whatever

Hello,
Set-versions are just that - versions. One must arguably think of them
in terms of VERSIONS. If you need a library, it is a legitimate
question to ask which version you need. If you think you need the
version at least 1.0, there's a good question: why the heck you think
you need the version at least 1.0 (and whether 2.0 would still fit).
With set-versions, things get straightforward: you need at least a
version which provides foo, bar, baz API symbols - that's much
better a description of a library than a god-damn arbitrary number.

There are some philosophical implications of introducing set-versions,
in particular, whether it can be extended to describe prototypes and
calling conventions (e.g. the number of arguments which must be passed
to a function). This is why I might seem reluctant to participate in
discussions. I'm thinking! (And perhaps I'm arrogant.)

 for set:versions, and for the generalization (for writing strict regression
 tests,
 its mostly useless in packaging because there is no mapping that specifies
 how the mixed DEB - RPM version comparison might be done naturally)

 Requires: deb(foo) = E:V-R
 Requires: rpm(foo) = E:V-R

 The precedent for foo(bar) name spacing in RPM dependencies with
 the above syntax is already widely deployed although entirely
 de facto.

 Sure would be nice _NOT_ to have to consider Have it your own way!
 competing syntaxes like
 Requires: libfoo.so.1 = set:whatever
 and
 Requires: set(libfoo.so.1) = whatever
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-05-30 Thread Jeffrey Johnson

On May 30, 2012, at 9:46 PM, Alexey Tourbin alexey.tour...@gmail.com wrote:

 On Mon, Apr 23, 2012 at 6:32 PM, Jeff Johnson n3npq@gmail.com wrote:
 I should point out that writing the attached
 message (and sending from the wrong e-mail address) has instantly
 led to a different -- and perhaps more natural -- syntax like
 
 Requires: set(libfoo.so.1) = whatever
 
 Hello,
 Set-versions are just that - versions. One must arguably think of them
 in terms of VERSIONS. If you need a library, it is a legitimate
 question to ask which version you need. If you think you need the
 version at least 1.0, there's a good question: why the heck you think
 you need the version at least 1.0 (and whether 2.0 would still fit).
 With set-versions, things get straightforward: you need at least a
 version which provides foo, bar, baz API symbols - that's much
 better a description of a library than a god-damn arbitrary number.
 

Yes VERSIONS.

The issue for RPM is how to represent/attach a different VERSION comparison.

A prefix marker with set: is mostly sane (until there
are multiple meanings for ':' and '~' and '-' separators
parsing the serialized VERSION representation which is
increasingly the state of affairs @rpm5.org).

So its perhaps a more natural syntax and a cleaner
representation to use the existing name spacing
convention with the form
Requires: set(bar) = VERSION
to distinguish in a namespace the VERSION that needs a
different comparison method attached.

But these are mostly details (until one has to write
the parsing regex and watch the combinatorial explosion
of possibilities with optional fields).

But I'm biased at the moment, having just today added
twiddle-in-version ...

 There are some philosophical implications of introducing set-versions,
 in particular, whether it can be extended to describe prototypes and
 calling conventions (e.g. the number of arguments which must be passed
 to a function). This is why I might seem reluctant to participate in
 discussions. I'm thinking! (And perhaps I'm arrogant.)
 

Thinking is good and arrogance matters little (to me ;-): set:versions
is the right thing to do, collapsing multiple redundant
just in case assertions into a single dependency representation.

hth

73 de jeff

 for set:versions, and for the generalization (for writing strict regression
 tests,
 its mostly useless in packaging because there is no mapping that specifies
 how the mixed DEB - RPM version comparison might be done naturally)
 
 Requires: deb(foo) = E:V-R
 Requires: rpm(foo) = E:V-R
 
 The precedent for foo(bar) name spacing in RPM dependencies with
 the above syntax is already widely deployed although entirely
 de facto.
 
 Sure would be nice _NOT_ to have to consider Have it your own way!
 competing syntaxes like
 Requires: libfoo.so.1 = set:whatever
 and
 Requires: set(libfoo.so.1) = whatever
 __
 RPM Package Managerhttp://rpm5.org
 Developer Communication Listrpm-devel@rpm5.org

__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-05-30 Thread Alexey Tourbin
On Thu, May 31, 2012 at 6:51 AM, Jeffrey Johnson n3...@me.com wrote:
 We are in violent agreement here over a minor issue
 of implementation/representation.

By the way, actual problems that will arise are rarely what you expect
them to be. In 2010, I was naive and I thought that char bitv[] was
a pretty good representation of bit sequence (which can be still seen
in set.c). It then took many days to devise a sophisticated decoding
routine which avoids bitv[] altogether and makes things smooth. So, in
a violent agreement, don't take things for granted. :-)

 The mixed code case is interesting: what happens
 if a set:version encoding contains the literal string 0:V-R

I can't understand you question. A version is either a set-version, or
not a set-version.  If a version is a set-version, it has to be
prefixed by set:. Regular RPM versions cannot be prefixed by set:,
because set cannot be decoded as a valid serial number, which has to
be an integer. There might be some implementation sloppiness out
there, but in principle, I believe the encoding scheme is sane, and
makes sense.

Now the question is, what if a set-version cannot be decoded? But that
can be perplexed by a question, what if a regular rpm version cannot
be decoded? Or can you decode any junk as a valid RPM version?

 and a match is attempted against a traditional dependency like
        Provides: foo = 0:V-R
 If the literal string in the Provides: is encoded on the
 fly, will setcmp(…) match or not?
__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org


Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version

2012-05-30 Thread Jeffrey Johnson

On May 30, 2012, at 11:24 PM, Alexey Tourbin alexey.tour...@gmail.com wrote:

 On Thu, May 31, 2012 at 6:51 AM, Jeffrey Johnson n3...@me.com wrote:
 We are in violent agreement here over a minor issue
 of implementation/representation.
 
 By the way, actual problems that will arise are rarely what you expect
 them to be. In 2010, I was naive and I thought that char bitv[] was
 a pretty good representation of bit sequence (which can be still seen
 in set.c). It then took many days to devise a sophisticated decoding
 routine which avoids bitv[] altogether and makes things smooth. So, in
 a violent agreement, don't take things for granted. :-)
 

In 2001, I thought that typedef'd enum's were the best
implementation for named bit fields because the names
were displayed in gdb backtraces. There is some speshul
painfulness not only from the typedef's in both C/C++
but also from the inability to extend the no. of bits
beyond whatever and int contains.

But these are implementation issues which I choose to
call minor ;-)

 The mixed code case is interesting: what happens
 if a set:version encoding contains the literal string 0:V-R
 
 I can't understand you question. A version is either a set-version, or
 not a set-version.  If a version is a set-version, it has to be
 prefixed by set:. Regular RPM versions cannot be prefixed by set:,
 because set cannot be decoded as a valid serial number, which has to
 be an integer. There might be some implementation sloppiness out
 there, but in principle, I believe the encoding scheme is sane, and
 makes sense.
 
 Now the question is, what if a set-version cannot be decoded? But that
 can be perplexed by a question, what if a regular rpm version cannot
 be decoded? Or can you decode any junk as a valid RPM version?
 

I'm trying to understand rpmsetcmp() as a black box independent
of all the gory implementation details of ELF symbols, base62 encoding,
and RPM dependencies.

I believe that set:versions are much like Bloom filters:
1) strings can be added to a set in any order
2) the comparison operation implied by
Requires: foo = set:….
is identical to contained in or is subset of
Is that the case?

73 de Jeff
 and a match is attempted against a traditional dependency like
Provides: foo = 0:V-R
 If the literal string in the Provides: is encoded on the
 fly, will setcmp(…) match or not?
 __
 RPM Package Managerhttp://rpm5.org
 Developer Communication Listrpm-devel@rpm5.org

__
RPM Package Managerhttp://rpm5.org
Developer Communication Listrpm-devel@rpm5.org