Jeff Johnson wrote:

>> I was recently looking at making a "manifest" for FreeBSD,
>> which consists of a simple files listing for *each package*.
>> 
>> ftp://ftp.freebsd.org/pub/FreeBSD/ports/amd64/packages-8.1-release/All/*.tbz
>> 
>> I was looking at the Slackware MANIFEST as a reference, which
>> is just a tarball listing of each file prefixed with a header:
>> 
>> ftp://ftp.slackware.com/pub/slackware/slackware64-13.1/slackware64/MANIFEST.bz2
>> 
> 
> Just splitting out a BaseURL, starts to save bytes in the manifest
> with no loss in generality other than each item in manifest is scoped within
> a BasURL somehow.

The main difference was that the Slackware header was more
"decorated", and that the tar listing used the verbose format.
If generating from the (src) ports, then the actual files aren't
known but only the file names so going with the less verbose:

++========================================
||
||   Package:  ./a/aaa_base-13.1-i486-2.txz
||
++========================================
drwxr-xr-x root/root         0 2010-05-13 13:51 ./
drwxr-xr-x root/root         0 2010-05-16 13:04 etc/
-rw-r--r-- root/root        17 2010-05-16 13:04 etc/slackware-version

# games/0verkill
bin/0verkill
bin/0verkill-avi
bin/0verkill-bot

The ports only stores files, not directories, due to the use
of mtree. The other packages needed a 'f' versus 'd' flag too.
No need to store the prefix either, since all files are in it.
(e.g. the ports are using /usr/local/etc, rather than /etc)

>> I could only come up with encoding each directory separately,
>> not anything more clever like the techniques mentioned above.
>> 
>> # portinfo
>> dirname1/basename1
>> dirname1/basename2
>> dirname2/basename3
>> 
>> =>
>> 
>> portinfo|dirname1/basename1:basename2|dirname2/basename3|
>> 
>> Should be able to do a better "portsearch" armed with that.
>> There's one available in C, but I wanted a Ruby version...
>> (just so that it fits into the existing pkgtools framework)
>> Existing one at http://people.freebsd.org/~vd/portsearch/
>> 
> 
> There may be other stores if/when searches/patterns become important.
> 
> If all that's needed is shorter and simple 1-pass recreation, the Woods
> (and similar prefix stores) are good. Memoizing like in *.solv is
> good for assembling hash lookups.

Reusing the dirnames and just storing the basenames does
give a nice space saving, but bandwidth-wise it's similar
(since the compression removes most of the redundancy...)
Nobody uses LZMA/XZ, but I suppose it would do even better.

131M    MANIFEST
 11M    MANIFEST.bz2

 52M    PLIST
9,3M    PLIST.bz2

This is similar to the Fedora results with "filelists":
(the SQLite version uses something similar to dirindex)

219M    filelists.xml
 17M    
9898b192b32412fbbc2645ef28295b9012aea1bead0f8f91e19351526f5edb15-filelists.xml.gz

CREATE TABLE packages (  pkgKey INTEGER PRIMARY KEY,  pkgId TEXT);
CREATE TABLE filelist (  pkgKey INTEGER,  dirname TEXT,  filenames TEXT,  
filetypes TEXT);

 99M    filelists.sqlite
 19M    
33f9c1696ff6bcfe3efd7fda9fadcc1ea6844c6157783ff2d28709d59c01c406-filelists.sqlite.bz2

So the actual format doesn't matter all that much, really.
What would have mattered more was to avoid downloading 40G...
(i.e. having to get each and every .tbz package file, just
because there is no manifest on what each of them contains)

Even if repeating the package on each line for greppability,
it doesn't affect the size (i.e. after compression) by much.
After unpacking it's enormous, and also slower due to later
having to feed all that redundant data through the I/O....

Like the Ubuntu format with "Contents", for instance:
(as used by the apt-file tool, or by grep directly)

256M    Contents-amd64
 17M    Contents-amd64.gz

The other approaches either used package "headers"
before starting a new list, or used package indexes.

That does make retrieval (slightly) more complicated,
since one must do an extra step to look up package...

>>> The point is that there's nothing very useful or sane about files.xml*
>>> as currently (and naively) encoded in XML. Just a teensy amount
>>> of thought saves far more bandwidth than any amount of spewage/formatting
>>> discussion.
>> 
>> 
>> The files.xml seems to be mostly ASCII text (in contents),
>> so seems to be on par with the Slackware MANIFEST offering...
>> And the filelists.xml is too, only it is using more markup...
>> Since it does a <file>path</file> tag wrap per line, as well.
>> 
> 
> Yes. At the time repo-md was designed, the goal was to split
> dependencies from files and thereby reduce the download size
> and server load. So noone cared what was in files.xml
> 
> That goal has not been achieved, and there are other savings,
> like HTTP 1.1 persistence possible these days that were not
> widely deployed when repo-md was devised.
> 
> The other goal for repo-md was standard XML markup. Which
> would make perfect sense if anything other than depsolvers
> used the markup. But yum has moved to sqlite, zypper to *.solv,
> smart has its pickle, and apt-rpm its cache, none of which
> really wish XML as the primary/important markup.
> 
> The point being is that files.* could be coded up almost
> anyway one wished, and regenerated as an XML stream if
> there was _REALLY_ an interest, and nothing much would
> really break. Yes an additional conversion might have to
> be introduced, just its not a hard/slow conversion because
> it can be done in a single pass without libraries and parsers
> and ... the Woods store can be coded up pretty quickly in
> any language, not just C.

I think the "dirindex" optimization will be enough for my
metadata, and do the other Incremental Encoding locally ?

That should also allow for path globbing, which is a common
use case. Both dirnames/basenames are sorted, for bsearch.

--anders

PS.
And then on to encoding the icons in XML as Base64... :-)
(the regular solution would use tar them up on the side,
and I suppose any SQLite implementation would use a BLOB)
Good thing it only wants the 48x48, and not the 512x512 ?

______________________________________________________________________
RPM Package Manager                                    http://rpm5.org
Developer Communication List                        [email protected]

Reply via email to