Jeff Johnson wrote: > There are some very simple data reductions on hierarchical > paths too. One of the best known is > Run a dictionary: assign an integer weighted by # of > occurences to favor small integers for frequently > encountered tokens between /.../ (all of "usr" and "bin" > and "lib" and "var" will have small integers assigned). > > This was used by *BSD find/locate to reduce the size needed > for all possible paths to something much smaller that could be > read/regenerated easily (KEY:VAL dictionary up front, integer > substituted strings with lookup for single pass regenerate). > > Somebody Woods devised the *BSD path locate encoding in 1984 iirc.
Google said http://techreports.lib.berkeley.edu/accessPages/CSD-83-148.html Finding Files Fast Authors: Woods, James A. Technical Report Identifier: CSD-83-148 January 15, 1983 Interesting reading, should have done this homework earlier... > see whatever linux is doing (likely not too much/better) with > find/locate for data reduction and speed-ups. > > I's expect (from actual experience on PDP/VAX ;-) at least another > order of magnitude reduction in data size _BEFORE_ compression. > > So the universe of parentdir Provides: likely fits in > 7.1 / 10 ~= 1% > of the size of the existing files.xml store. > > None of the above is rocket science, just exploiting redundancy > redundancy in /bing/bang/boom paths with specific fore knowledge, > rather than relying on Huffman encoded dictionaries with adaptive > statistics. I was recently looking at making a "manifest" for FreeBSD, which consists of a simple files listing for *each package*. ftp://ftp.freebsd.org/pub/FreeBSD/ports/amd64/packages-8.1-release/All/*.tbz I was looking at the Slackware MANIFEST as a reference, which is just a tarball listing of each file prefixed with a header: ftp://ftp.slackware.com/pub/slackware/slackware64-13.1/slackware64/MANIFEST.bz2 I could only come up with encoding each directory separately, not anything more clever like the techniques mentioned above. # portinfo dirname1/basename1 dirname1/basename2 dirname2/basename3 => portinfo|dirname1/basename1:basename2|dirname2/basename3| Should be able to do a better "portsearch" armed with that. There's one available in C, but I wanted a Ruby version... (just so that it fits into the existing pkgtools framework) Existing one at http://people.freebsd.org/~vd/portsearch/ > The point is that there's nothing very useful or sane about files.xml* > as currently (and naively) encoded in XML. Just a teensy amount > of thought saves far more bandwidth than any amount of spewage/formatting > discussion. The files.xml seems to be mostly ASCII text (in contents), so seems to be on par with the Slackware MANIFEST offering... And the filelists.xml is too, only it is using more markup... Since it does a <file>path</file> tag wrap per line, as well. > Disclaimers: > All of the above was delivered to the people who designed > the files.xml at the time. > > And the Woods algorithm is what is/was responsible for > the change to represent file paths as {DIRNAMES,BASENAMES,DIRINDEXES} > just ewt dinna quite understand what I was suggesting, and so > did Something Else Instead in rpm-3.0.4. Oh well ... the boss is always > right. I was getting the savings in size/bandwidth, but not the improvements in speed. Should do a better implementation. It all started with a naive request of "list all packages that have a *.desktop file", which turned out "hard to do". --anders PS. Original use case was extracting all .desktop, with icons: https://github.com/hughsie/app-install (or as XML perhaps) ______________________________________________________________________ RPM Package Manager http://rpm5.org Developer Communication List [email protected]
