On Fri, Apr 20, 2012 at 4:12 PM, R P Herrold <herr...@owlriver.com> wrote: > On Fri, 20 Apr 2012, Alexey Tourbin wrote: > >> I have just learnt that rpm5 project has borrowed set-string >> implementation recently from ALT Linux. At the very same time, I was >> working on on a new and improved encoding scheme which can make > > ... This is exciting news -- This seems like it would be a useful library. > What package in ALT are you doing this work in, or alternatiely, is there a > version control repository that I could check out to read your ongoing work?
The goal is not only to improve the implementation, but also to refine basic concepts and designs. Actually, set.c is already usable as a library on its own. Why, it provides an API for creating set-versions, and it also implements rpmsetcmp() comparison routine. That's just enough to get the job done, and everything else then is the implementation details which it hides in a particularly perfect manner. On the other hand, the details are concealed perhaps a bit too much - up to the point where it is not clear what set-versions are and how exactly they are supposed to work. It is desirable then to expose a lower-level API which clarifies the concept of set-versions while still hiding less important implementation details. So what's a set-version? Intuitively, a set-version represents a set of symbols. This actually suggests a new "de facto" approach to library versioning: the required version of a library, in addition to its soname, is simply the set of library symbols (i.e. functions and global variables) which we need to use. Likewise, the version provided by a library is simply the set of all symbols exported by the library. The key point is that Requires versions and Provides versions can be produced in a relatively independent manner, and meaningfully compared at later stages; that is, it is possible to check if R \subset P. The check is probabilistic, which indicates a possibility of error. The error rate is reasonably small, though, and can be further controlled by a parameter. What's more important is that only "false positive" kind of error is possible - that is, in the worst case, the check simply does not work, but at least we lose nothing. Another kind of error, a "false alarm", is not possible. In this respect, set-versions are similar to Bloom filters. Set-versions have other nice properties which you might suspect, and the one which is not so nice: the length of n-element set-version is O(n). This is a fundamental limit which cannot be overcome. However, a practical and feasible implementation is still possible. This outlines two implementation priorities: 1) set-versions should be as short in size as possible - actually their size should be close to the information-theoretical minimum; 2) however, this must not tamper with the possibility of fast decoding and comparison. With current implementation, when the error rate is fixed at about 0.1%, Provides version take about 2 alphanumeric characters per symbol, and Requires versions, since they are much more sparse, can take up to 3 alphanumeric characters per symbol. For a repo of about 10,000 packages, Requires and Provides set-versions can take only about 10M total (but this assumes that some dependency optimizations are performed and also that superfluous plugin-like Provides are excluded). The check of all set-versioned dependencies, such as performed by "apt-cache unmet", can be finished within a second. To sum up, this is a compromise; but it is a favorable compromise. So what a set-version really is? What if we say that a set-version is just a set of numbers, such as hash values obtained after hashing each symbol individually? Simple as it is, this approach can be used to express everything else. More precisely, we need a scheme to encode n m-bit numbers. For some reason which will become apparent later, we need to supply m, which is actual bits per hash value, aka bpp, explicitly. So I think we can define a lower-level "set-string" encoding API as follows: /** \ingroup setstring * Estimate the size of a string buffer for encoding. * @param v the values, sorted and unique * @param n number of values * @param bpp actual bits per value, 8..32 * @return buffer size for encoding, < 0 on error */ int setstringEncodeSize(const unsigned *v, int n, int bpp); /** \ingroup setstring * Encode a set of numeric values into alnum string. * @param v the values, sorted and unique * @param n number of values * @param bpp actual bits per value, 8..32 * @param s alnum output, null-terminated on success * @return alnum string length, < 0 on error */ int setstringEncode(const unsigned *v, int n, int bpp, char *s); The decoding API mirrors the encoding routines: /** \ingroup setstring * Estimate the number of values obtained after decoding a set-string. * @param s alnum string to decode, null-terminated * @param len alnum string length * @return number of values (upper size), < 0 on error */ int setstringDecodeSize(const char *s, int len); /** \ingroup setstring * Decode alnum string into a set of numeric values. * @param s alnum string to decode, null-terminated * @param len alnum string length * @v decoded values (sorted and unique) * @pbpp original bits per value * @return number of values, < 0 on error */ int setstringDecode(const char *s, int len, unsigned *v, int *pbpp); The first things to notice is that these routines only do bit crunching, and they don't not try to do something else which can be even remotely inefficient. For example, they do not allocate storage on their own; instead, they facilitate allocation by the caller. This is because many Requires versions are small, and can be decoded on the stack, thus saving a malloc() call. If malloc is cheap, this might turn out to be an unnecessary complication. The decoding routine won't even call strlen(). This is because average Provides version length is about 2K, and in certain scenarios the length can be reused. The are also two important things to notice in this overall design: 1) the error rate can be controlled by choosing an appropriate bpp parameter; 2) it is even possible to compare two set-versions with different bpp parameter, provided that the same hash function, modulo power of two, was used: the set v[] with higher bpp value can be "downsampled" by stripping higher bits and sorting the values again. The R \subset P check is itself straightforward. ______________________________________________________________________ RPM Package Manager http://rpm5.org Developer Communication List rpm-devel@rpm5.org