Bug#494673: solution: the next version of dlocate won't depend on locate at all

Tomas Pospisek Tue, 02 Jun 2009 12:04:51 -0700

Hello Craig,

good idea!

let's consider what would happen if dlocate would work with compressedtext files instead:


On Sat, 30 May 2009, Craig Sanders wrote:

i've just done some simple tests and found the following:

1. on one of my systems (a laptop with 128MB RAM), dlocatedb takes up
  696KB of disk space. a plain text dump of it takes up 3.0MB

  text dump generated with 'dlocate / > dlocate.txt'

  i then made sure that both dlocatedb and dlocate.txt were not
  in the disk cache by catting approx 150MB of files to /dev/null.

# ls -lh dlocate.txt dlocatedb
-rw-r--r-- 1 root root 3.0M 2009-05-30 11:14 dlocate.txt
-rw-r--r-- 1 root root 696K 2009-05-30 06:29 dlocatedb

# wc -l dlocate.txt
62090 dlocate.txt

That means the dlocate DB in raw text form is roughly 5 times larger.Compare that with a gzipped text DB:


$ ls -l /tmp/dump.txt /tmp/dump.txt.gz
-rw-r--r-- 1 tpo tpo 13270880 2009-06-02 20:11 /tmp/dump.txt
-rw-r--r-- 1 tpo tpo  1183828 2009-06-02 20:11 /tmp/dump.txt.gz

Thus the gzipped text DB is about 10* smaller than the raw text DB. Thismeans that the gzipped text DB would be about half the size of theoriginal dlocatedb.

2. searching the dlocatedb with locate for a single file takes 1.091
  seconds. grepping for the same file in the text dump takes 0.584
  seconds.

the filename "usr/share/doc/apache2.2-bin/changelog.gz" was chosen because
it is the very last line in dlocate.txt

# time dlocate usr/share/doc/apache2.2-bin/changelog.gz
apache2.2-bin: /usr/share/doc/apache2.2-bin/changelog.gz

real    0m1.091s
user    0m0.484s
sys     0m0.044s

# time grep usr/share/doc/apache2.2-bin/changelog.gz dlocate.txt
apache2.2-bin: /usr/share/doc/apache2.2-bin/changelog.gz

real    0m0.584s
user    0m0.008s
sys     0m0.020s

I did a "find /usr -exec cat \{\} \;" to empty the buffer cache and thengrepped for the last entry in the file as you did (grepping for the firstone was about 25% faster - probably within the margin of error):


$ time grep /usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common 
/tmp/dump.txt
konqueror-plugin-khtmlsettings: 
/usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common

real    0m0.025s
user    0m0.020s
sys     0m0.004s
$ time zgrep /usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common 
/tmp/dump.txt.gz
konqueror-plugin-khtmlsettings: 
/usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common

real    0m0.178s
user    0m0.132s
sys     0m0.020s

Thus grepping through the gzipped file is about 10 times slower, howeverit's still very fast, at least on my system.

3. repeating the test immediately with both files cached in RAM gives
  0.512 seconds (dlocate) and 0.034s (grep)

# time dlocate usr/share/doc/apache2.2-bin/changelog.gz
apache2.2-bin: /usr/share/doc/apache2.2-bin/changelog.gz

real    0m0.512s
user    0m0.476s
sys     0m0.032s

# time grep usr/share/doc/apache2.2-bin/changelog.gz dlocate.txt
apache2.2-bin: /usr/share/doc/apache2.2-bin/changelog.gz

real    0m0.034s
user    0m0.012s
sys     0m0.024s

on the first run, grep is twice as fast as dlocate. on subsequent runs,
it is about 15 times faster.


This doesn't make a difference here for raw vs gzipped text.

there appears to be no advantage whatsoever to using frcode any more (in
fact, locate is much slower than plain grep), and disk space is so cheap
that the difference between 700KB and 3MB is irrelevant.

accordingly the solution to this on-going dlocate/locate/mlocate
confusion will be the release of a new version of dlocate that doesn't
use or depend on frcode or locate, but instead just uses a plain text
file and grep.

i have a few other things on my TODO list for dlocate.  I'll get them
done and release a new version. hopefully this weekend if real life
doesn't intrude.

i think i'll also add a few more options to dlocate to take advantage of
GNU grep's ability to use different Matchers - from grep(1):

  Matcher Selection
      -E, --extended-regexp
             Interpret PATTERN as an extended regular expression (ERE,
             see below).  (-E is specified by POSIX.)

      -F, --fixed-strings
             Interpret PATTERN as a list of fixed strings, separated by
             newlines, any of which is to be matched.  (-F is specified
             by POSIX.)

      -G, --basic-regexp
             Interpret PATTERN as a basic regular expression (BRE, see
             below).  This is the default.

      -P, --perl-regexp
             Interpret PATTERN as a Perl regular expression.  This is
             highly experimental and grep -P may warn of unimplemented
             features.

and i'll support -w too:

      -w, --word-regexp
             Select only those lines containing matches that form whole
             words.  The test is that the matching substring must
             either be at the beginning of the line, or preceded by a
             non-word constituent character.  Similarly, it must be
             either at the end of the line or followed by a non-word
             constituent character.  Word-constituent characters are
             letters, digits, and the underscore.


this will change the way that dlocate works (in that it does a regexp search
rather than a plain text search) but, IMO, that's far more useful.  GNU locate
has an option to do a regexp search but the timing comparison gets even more
in favour of grep:

# time locate.findutils -d /var/lib/dlocate/dlocatedb -r 
usr/share/doc/apache2.2-bin/changelog.gz
apache2.2-bin: /usr/share/doc/apache2.2-bin/changelog.gz

real    0m1.796s
user    0m1.640s
sys     0m0.012s

1.796 seconds for the first run after flushing disk cache, compared to
0.512 seconds for grep. grep is over 3 times faster.

on subsequent runs, the regexp locate still takes over 1.6 seconds,
while grep takes 0.034 seconds. over 47 times faster. obviously, and not
at all surprisingly, grepping an frcode database is not a very efficient
operation.

# time locate.findutils -d /var/lib/dlocate/dlocatedb -r 
usr/share/doc/apache2.2-bin/changelog.gz
apache2.2-bin: /usr/share/doc/apache2.2-bin/changelog.gz

real    0m1.640s
user    0m1.628s
sys     0m0.008s


t...@tpo-laptop:~$ time dlocate / > /tmp/dump.txt

real    0m0.297s
user    0m0.204s
sys     0m0.088s
t...@tpo-laptop:~$ time dlocate / | gzip > /tmp/dump.txt.gz

real    0m0.540s
user    0m0.504s
sys     0m0.024s

Building the text DB is twice as slow for the gzipped version.

What if we optimize for speed instead?

$ time dlocate / | gzip --fast > /tmp/dump.txt.fast.gz

real    0m0.348s
user    0m0.488s
sys     0m0.020s

Now building the fast-gzipped text DB is about as fast as the raw textone. However zgrepping it is slightly slower:


$ time zgrep /usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common 
/tmp/dump.txt.gz
konqueror-plugin-khtmlsettings: 
/usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common

real    0m0.143s
user    0m0.128s
sys     0m0.020s

$ time zgrep /usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common 
/tmp/dump.txt.fast.gz
konqueror-plugin-khtmlsettings: 
/usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common

real    0m0.167s
user    0m0.140s
sys     0m0.012s

If our emphasis is on search time, then maybe using a compressionalgorithm that omptimizes on decompression would be ideal. I tried lzma,which in default mode is slow with compression:


$ time dlocate / | lzma > /tmp/dump.txt.lzma

real    0m9.683s
user    0m9.509s
sys     0m0.088s

And slower than gzip when using the fast variant:

$ time dlocate / | lzma --fast > /tmp/dump.txt.fast.lzma

real    0m0.789s
user    0m0.740s
sys     0m0.040s

However grepping the lzma file is surprisingly slower than zgrep:

$ time ( unlzma -c /tmp/dump.txt.lzma | grep 
/usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common )
konqueror-plugin-khtmlsettings: 
/usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common

real    0m0.243s
user    0m0.204s
sys     0m0.044s

Maybe however it's not fair to compaire "zgrep" and "unlzma | grep" andanyway, I wonder what it is that I'm actually measuring here...

I did not test UCL which claims to be one of the fastest decompressingalgorithms...


Thanks and greets Craig!
*t



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#494673: solution: the next version of dlocate won't depend on locate at all

Reply via email to