Jeff Johnson wrote: >> I was recently looking at making a "manifest" for FreeBSD, >> which consists of a simple files listing for *each package*. >> >> ftp://ftp.freebsd.org/pub/FreeBSD/ports/amd64/packages-8.1-release/All/*.tbz >> >> I was looking at the Slackware MANIFEST as a reference, which >> is just a tarball listing of each file prefixed with a header: >> >> ftp://ftp.slackware.com/pub/slackware/slackware64-13.1/slackware64/MANIFEST.bz2 >> > > Just splitting out a BaseURL, starts to save bytes in the manifest > with no loss in generality other than each item in manifest is scoped within > a BasURL somehow.
The main difference was that the Slackware header was more "decorated", and that the tar listing used the verbose format. If generating from the (src) ports, then the actual files aren't known but only the file names so going with the less verbose: ++======================================== || || Package: ./a/aaa_base-13.1-i486-2.txz || ++======================================== drwxr-xr-x root/root 0 2010-05-13 13:51 ./ drwxr-xr-x root/root 0 2010-05-16 13:04 etc/ -rw-r--r-- root/root 17 2010-05-16 13:04 etc/slackware-version # games/0verkill bin/0verkill bin/0verkill-avi bin/0verkill-bot The ports only stores files, not directories, due to the use of mtree. The other packages needed a 'f' versus 'd' flag too. No need to store the prefix either, since all files are in it. (e.g. the ports are using /usr/local/etc, rather than /etc) >> I could only come up with encoding each directory separately, >> not anything more clever like the techniques mentioned above. >> >> # portinfo >> dirname1/basename1 >> dirname1/basename2 >> dirname2/basename3 >> >> => >> >> portinfo|dirname1/basename1:basename2|dirname2/basename3| >> >> Should be able to do a better "portsearch" armed with that. >> There's one available in C, but I wanted a Ruby version... >> (just so that it fits into the existing pkgtools framework) >> Existing one at http://people.freebsd.org/~vd/portsearch/ >> > > There may be other stores if/when searches/patterns become important. > > If all that's needed is shorter and simple 1-pass recreation, the Woods > (and similar prefix stores) are good. Memoizing like in *.solv is > good for assembling hash lookups. Reusing the dirnames and just storing the basenames does give a nice space saving, but bandwidth-wise it's similar (since the compression removes most of the redundancy...) Nobody uses LZMA/XZ, but I suppose it would do even better. 131M MANIFEST 11M MANIFEST.bz2 52M PLIST 9,3M PLIST.bz2 This is similar to the Fedora results with "filelists": (the SQLite version uses something similar to dirindex) 219M filelists.xml 17M 9898b192b32412fbbc2645ef28295b9012aea1bead0f8f91e19351526f5edb15-filelists.xml.gz CREATE TABLE packages ( pkgKey INTEGER PRIMARY KEY, pkgId TEXT); CREATE TABLE filelist ( pkgKey INTEGER, dirname TEXT, filenames TEXT, filetypes TEXT); 99M filelists.sqlite 19M 33f9c1696ff6bcfe3efd7fda9fadcc1ea6844c6157783ff2d28709d59c01c406-filelists.sqlite.bz2 So the actual format doesn't matter all that much, really. What would have mattered more was to avoid downloading 40G... (i.e. having to get each and every .tbz package file, just because there is no manifest on what each of them contains) Even if repeating the package on each line for greppability, it doesn't affect the size (i.e. after compression) by much. After unpacking it's enormous, and also slower due to later having to feed all that redundant data through the I/O.... Like the Ubuntu format with "Contents", for instance: (as used by the apt-file tool, or by grep directly) 256M Contents-amd64 17M Contents-amd64.gz The other approaches either used package "headers" before starting a new list, or used package indexes. That does make retrieval (slightly) more complicated, since one must do an extra step to look up package... >>> The point is that there's nothing very useful or sane about files.xml* >>> as currently (and naively) encoded in XML. Just a teensy amount >>> of thought saves far more bandwidth than any amount of spewage/formatting >>> discussion. >> >> >> The files.xml seems to be mostly ASCII text (in contents), >> so seems to be on par with the Slackware MANIFEST offering... >> And the filelists.xml is too, only it is using more markup... >> Since it does a <file>path</file> tag wrap per line, as well. >> > > Yes. At the time repo-md was designed, the goal was to split > dependencies from files and thereby reduce the download size > and server load. So noone cared what was in files.xml > > That goal has not been achieved, and there are other savings, > like HTTP 1.1 persistence possible these days that were not > widely deployed when repo-md was devised. > > The other goal for repo-md was standard XML markup. Which > would make perfect sense if anything other than depsolvers > used the markup. But yum has moved to sqlite, zypper to *.solv, > smart has its pickle, and apt-rpm its cache, none of which > really wish XML as the primary/important markup. > > The point being is that files.* could be coded up almost > anyway one wished, and regenerated as an XML stream if > there was _REALLY_ an interest, and nothing much would > really break. Yes an additional conversion might have to > be introduced, just its not a hard/slow conversion because > it can be done in a single pass without libraries and parsers > and ... the Woods store can be coded up pretty quickly in > any language, not just C. I think the "dirindex" optimization will be enough for my metadata, and do the other Incremental Encoding locally ? That should also allow for path globbing, which is a common use case. Both dirnames/basenames are sorted, for bsearch. --anders PS. And then on to encoding the icons in XML as Base64... :-) (the regular solution would use tar them up on the side, and I suppose any SQLite implementation would use a BLOB) Good thing it only wants the 48x48, and not the 512x512 ? ______________________________________________________________________ RPM Package Manager http://rpm5.org Developer Communication List [email protected]
