On Dec 14, 2010, at 6:46 PM, Anders F Björklund wrote:

> Jeff Johnson wrote:
> 
>> There are some very simple data reductions on hierarchical
>> paths too. One of the best known is
>>      Run a dictionary: assign an integer weighted by # of
>>      occurences to favor small integers for frequently
>>      encountered tokens between /.../ (all of "usr" and "bin"
>>      and "lib" and "var" will have small integers assigned).
>> 
>> This was used by *BSD find/locate to reduce the size needed
>> for all possible paths to something much smaller that could be
>> read/regenerated easily (KEY:VAL dictionary up front, integer
>> substituted strings with lookup for single pass regenerate).
>> 
>> Somebody Woods devised the *BSD path locate encoding in 1984 iirc.
> 
> Google said http://techreports.lib.berkeley.edu/accessPages/CSD-83-148.html
> 
> Finding Files Fast
> Authors:  Woods, James A.
> Technical Report Identifier: CSD-83-148
> January 15, 1983 
> 

Bingo. Off by a year, and the chloroxed neurons resisted
confusion with the other James Wood hash bucket collision. Good.

Note that essentially the same dictionary encoding is/was in use
in PC-DOS and MS-DOS for error messages to save expensive bits
in BIOS ROM's.

> Interesting reading, should have done this homework earlier...
> 

There's better and more clever, its just a prefix store, and
so a crit-bit (djb lingo) or Patricia could also be done,
but not quite as simply as the Woods store is. Trust me: PDP/Vax multiuser
uglix systems really needed KISS, not XML.

>> see whatever linux is doing (likely not too much/better) with
>> find/locate for data reduction and speed-ups.
>> 
>> I's expect (from actual experience on PDP/VAX ;-) at least another
>> order of magnitude reduction in data size _BEFORE_ compression.
>> 
>> So the universe of parentdir Provides: likely fits in
>>      7.1 / 10 ~= 1%
>> of the size of the existing files.xml store.
>> 
>> None of the above is rocket science, just exploiting redundancy
>> redundancy in /bing/bang/boom paths with specific fore knowledge,
>> rather than relying on Huffman encoded dictionaries with adaptive
>> statistics.
> 
> I was recently looking at making a "manifest" for FreeBSD,
> which consists of a simple files listing for *each package*.
> 
> ftp://ftp.freebsd.org/pub/FreeBSD/ports/amd64/packages-8.1-release/All/*.tbz
> 
> I was looking at the Slackware MANIFEST as a reference, which
> is just a tarball listing of each file prefixed with a header:
> 
> ftp://ftp.slackware.com/pub/slackware/slackware64-13.1/slackware64/MANIFEST.bz2
> 

Just splitting out a BaseURL, starts to save bytes in the manifest
with no loss in generality other than each item in manifest is scoped within
a BasURL somehow.

> I could only come up with encoding each directory separately,
> not anything more clever like the techniques mentioned above.
> 
> # portinfo
> dirname1/basename1
> dirname1/basename2
> dirname2/basename3
> 
> =>
> 
> portinfo|dirname1/basename1:basename2|dirname2/basename3|
> 
> Should be able to do a better "portsearch" armed with that.
> There's one available in C, but I wanted a Ruby version...
> (just so that it fits into the existing pkgtools framework)
> Existing one at http://people.freebsd.org/~vd/portsearch/
> 

There may be other stores if/when searches/patterns become important.

If all that's needed is shorter and simple 1-pass recreation, the Woods
(and similar prefix stores) are good. Memoizing like in *.solv is
good for assembling hash lookups.

>> The point is that there's nothing very useful or sane about files.xml*
>> as currently (and naively) encoded in XML. Just a teensy amount
>> of thought saves far more bandwidth than any amount of spewage/formatting
>> discussion.
> 
> 
> The files.xml seems to be mostly ASCII text (in contents),
> so seems to be on par with the Slackware MANIFEST offering...
> And the filelists.xml is too, only it is using more markup...
> Since it does a <file>path</file> tag wrap per line, as well.
> 

Yes. At the time repo-md was designed, the goal was to split
dependencies from files and thereby reduce the download size
and server load. So noone cared what was in files.xml

That goal has not been achieved, and there are other savings,
like HTTP 1.1 persistence possible these days that were not
widely deployed when repo-md was devised.

The other goal for repo-md was standard XML markup. Which
would make perfect sense if anything other than depsolvers
used the markup. But yum has moved to sqlite, zypper to *.solv,
smart has its pickle, and apt-rpm its cache, none of which
really wish XML as the primary/important markup.

The point being is that files.* could be coded up almost
anyway one wished, and regenerated as an XML stream if
there was _REALLY_ an interest, and nothing much would
really break. Yes an additional conversion might have to
be introduced, just its not a hard/slow conversion because
it can be done in a single pass without libraries and parsers
and ... the Woods store can be coded up pretty quickly in
any language, not just C.

Perhaps there's some UTF-8 encoding issues these days *shrug*
its still just uglix file paths mostly.

>> Disclaimers:
>>      All of the above was delivered to the people who designed
>>      the files.xml at the time.
>> 
>>      And the Woods algorithm is what is/was responsible for
>>      the change to represent file paths as {DIRNAMES,BASENAMES,DIRINDEXES}
>>      just ewt dinna quite understand what I was suggesting, and so
>>      did Something Else Instead in rpm-3.0.4. Oh well ... the boss is always 
>> right.
> 
> 
> I was getting the savings in size/bandwidth, but not the
> improvements in speed. Should do a better implementation.
> It all started with a naive request of "list all packages
> that have a *.desktop file", which turned out "hard to do". 
> 
> --anders
> 
> 
> PS.
> Original use case was extracting all .desktop, with icons:
> https://github.com/hughsie/app-install (or as XML perhaps)
> 

Good luck!

hth

73 de Jeff
> ______________________________________________________________________
> RPM Package Manager                                    http://rpm5.org
> Developer Communication List                        rpm-devel@rpm5.org

______________________________________________________________________
RPM Package Manager                                    http://rpm5.org
Developer Communication List                        rpm-devel@rpm5.org

Reply via email to