Re: refined implementation of set-versions

Alexey Tourbin Fri, 20 Apr 2012 22:53:01 -0700

On Fri, Apr 20, 2012 at 4:12 PM, R P Herrold <[email protected]> wrote:
> On Fri, 20 Apr 2012, Alexey Tourbin wrote:
>
>> I have just learnt that rpm5 project has borrowed set-string
>> implementation recently from ALT Linux. At the very same time, I was
>> working on on a new and improved encoding scheme which can make
>
> ... This is exciting news -- This seems like it would be a useful library.
>  What package in ALT are you doing this work in, or alternatiely, is there a
> version control repository that I could check out to read your ongoing work?


The goal is not only to improve the implementation, but also to refine
basic concepts and designs.  Actually, set.c is already usable as a
library on its own. Why, it provides an API for creating set-versions,
and it also implements rpmsetcmp() comparison routine. That's just
enough to get the job done, and everything else then is the
implementation details which it hides in a particularly perfect
manner. On the other hand, the details are concealed perhaps a bit too
much - up to the point where it is not clear what set-versions are and
how exactly they are supposed to work. It is desirable then to expose
a lower-level API which clarifies the concept of set-versions while
still hiding less important implementation details.

So what's a set-version?

Intuitively, a set-version represents a set of symbols. This actually
suggests a new "de facto" approach to library versioning: the required
version of a library, in addition to its soname, is simply the set of
library symbols (i.e. functions and global variables) which we need to
use. Likewise, the version provided by a library is simply the set of
all symbols exported by the library. The key point is that Requires
versions and Provides versions can be produced in a relatively
independent manner, and meaningfully compared at later stages; that
is, it is possible to check if R \subset P.  The check is
probabilistic, which indicates a possibility of error.  The error rate
is reasonably small, though, and can be further controlled by a
parameter. What's more important is that only "false positive" kind of
error is possible - that is, in the worst case, the check simply does
not work, but at least we lose nothing. Another kind of error, a
"false alarm", is not possible. In this respect, set-versions are
similar to Bloom filters.

Set-versions have other nice properties which you might suspect, and
the one which is not so nice: the length of n-element set-version is
O(n). This is a fundamental limit which cannot be overcome. However, a
practical and feasible implementation is still possible. This outlines
two implementation priorities: 1) set-versions should be as short in
size as possible - actually their size should be close to the
information-theoretical minimum; 2) however, this must not tamper with
the possibility of fast decoding and comparison. With current
implementation, when the error rate is fixed at about 0.1%, Provides
version take about 2 alphanumeric characters per symbol, and Requires
versions, since they are much more sparse, can take up to 3
alphanumeric characters per symbol. For a repo of about 10,000
packages, Requires and Provides set-versions can take only about 10M
total (but this assumes that some dependency optimizations are
performed and also that superfluous plugin-like Provides are
excluded). The check of all set-versioned dependencies, such as
performed by "apt-cache unmet", can be finished within a second. To
sum up, this is a compromise; but it is a favorable compromise.

So what a set-version really is?  What if we say that a set-version is
just a set of numbers, such as hash values obtained after hashing each
symbol individually? Simple as it is, this approach can be used to
express everything else. More precisely, we need a scheme to encode n
m-bit numbers. For some reason which will become apparent later, we
need to supply m, which is actual bits per hash value, aka bpp,
explicitly.

So I think we can define a lower-level "set-string" encoding API as follows:

/** \ingroup setstring
 * Estimate the size of a string buffer for encoding.
 * @param v             the values, sorted and unique
 * @param n             number of values
 * @param bpp           actual bits per value, 8..32
 * @return              buffer size for encoding, < 0 on error
 */
int setstringEncodeSize(const unsigned *v, int n, int bpp);

/** \ingroup setstring
 * Encode a set of numeric values into alnum string.
 * @param v             the values, sorted and unique
 * @param n             number of values
 * @param bpp           actual bits per value, 8..32
 * @param s             alnum output, null-terminated on success
 * @return              alnum string length, < 0 on error
 */
int setstringEncode(const unsigned *v, int n, int bpp, char *s);

The decoding API mirrors the encoding routines:

/** \ingroup setstring
 * Estimate the number of values obtained after decoding a set-string.
 * @param s             alnum string to decode, null-terminated
 * @param len           alnum string length
 * @return              number of values (upper size), < 0 on error
 */
int setstringDecodeSize(const char *s, int len);

/** \ingroup setstring
 * Decode alnum string into a set of numeric values.
 * @param s             alnum string to decode, null-terminated
 * @param len           alnum string length
 * @v                   decoded values (sorted and unique)
 * @pbpp                original bits per value
 * @return              number of values, < 0 on error
 */
int setstringDecode(const char *s, int len, unsigned *v, int *pbpp);

The first things to notice is that these routines only do bit
crunching, and they don't not try to do something else which can be
even remotely inefficient. For example, they do not allocate storage
on their own; instead, they facilitate allocation by the caller. This
is because many Requires versions are small, and can be decoded on the
stack, thus saving a malloc() call. If malloc is cheap, this might
turn out to be an unnecessary complication. The decoding routine won't
even call strlen(). This is because average Provides version length is
about 2K, and in certain scenarios the length can be reused.

The are also two important things to notice in this overall design: 1)
the error rate can be controlled by choosing an appropriate bpp
parameter; 2) it is even possible to compare two set-versions with
different bpp parameter, provided that the same hash function, modulo
power of two, was used: the set v[] with higher bpp value can be
"downsampled" by stripping higher bits and sorting the values again.
The R \subset P check is itself straightforward.
______________________________________________________________________
RPM Package Manager                                    http://rpm5.org
Developer Communication List                        [email protected]

Re: refined implementation of set-versions

Reply via email to