Hello Craig,
good idea!
let's consider what would happen if dlocate would work with compressed
text files instead:
On Sat, 30 May 2009, Craig Sanders wrote:
i've just done some simple tests and found the following:
1. on one of my systems (a laptop with 128MB RAM), dlocatedb takes up
696KB of disk space. a plain text dump of it takes up 3.0MB
text dump generated with 'dlocate / > dlocate.txt'
i then made sure that both dlocatedb and dlocate.txt were not
in the disk cache by catting approx 150MB of files to /dev/null.
# ls -lh dlocate.txt dlocatedb
-rw-r--r-- 1 root root 3.0M 2009-05-30 11:14 dlocate.txt
-rw-r--r-- 1 root root 696K 2009-05-30 06:29 dlocatedb
# wc -l dlocate.txt
62090 dlocate.txt
That means the dlocate DB in raw text form is roughly 5 times larger.
Compare that with a gzipped text DB:
$ ls -l /tmp/dump.txt /tmp/dump.txt.gz
-rw-r--r-- 1 tpo tpo 13270880 2009-06-02 20:11 /tmp/dump.txt
-rw-r--r-- 1 tpo tpo 1183828 2009-06-02 20:11 /tmp/dump.txt.gz
Thus the gzipped text DB is about 10* smaller than the raw text DB. This
means that the gzipped text DB would be about half the size of the
original dlocatedb.
2. searching the dlocatedb with locate for a single file takes 1.091
seconds. grepping for the same file in the text dump takes 0.584
seconds.
the filename "usr/share/doc/apache2.2-bin/changelog.gz" was chosen because
it is the very last line in dlocate.txt
# time dlocate usr/share/doc/apache2.2-bin/changelog.gz
apache2.2-bin: /usr/share/doc/apache2.2-bin/changelog.gz
real 0m1.091s
user 0m0.484s
sys 0m0.044s
# time grep usr/share/doc/apache2.2-bin/changelog.gz dlocate.txt
apache2.2-bin: /usr/share/doc/apache2.2-bin/changelog.gz
real 0m0.584s
user 0m0.008s
sys 0m0.020s
I did a "find /usr -exec cat \{\} \;" to empty the buffer cache and then
grepped for the last entry in the file as you did (grepping for the first
one was about 25% faster - probably within the margin of error):
$ time grep /usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common
/tmp/dump.txt
konqueror-plugin-khtmlsettings:
/usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common
real 0m0.025s
user 0m0.020s
sys 0m0.004s
$ time zgrep /usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common
/tmp/dump.txt.gz
konqueror-plugin-khtmlsettings:
/usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common
real 0m0.178s
user 0m0.132s
sys 0m0.020s
Thus grepping through the gzipped file is about 10 times slower, however
it's still very fast, at least on my system.
3. repeating the test immediately with both files cached in RAM gives
0.512 seconds (dlocate) and 0.034s (grep)
# time dlocate usr/share/doc/apache2.2-bin/changelog.gz
apache2.2-bin: /usr/share/doc/apache2.2-bin/changelog.gz
real 0m0.512s
user 0m0.476s
sys 0m0.032s
# time grep usr/share/doc/apache2.2-bin/changelog.gz dlocate.txt
apache2.2-bin: /usr/share/doc/apache2.2-bin/changelog.gz
real 0m0.034s
user 0m0.012s
sys 0m0.024s
on the first run, grep is twice as fast as dlocate. on subsequent runs,
it is about 15 times faster.
This doesn't make a difference here for raw vs gzipped text.
there appears to be no advantage whatsoever to using frcode any more (in
fact, locate is much slower than plain grep), and disk space is so cheap
that the difference between 700KB and 3MB is irrelevant.
accordingly the solution to this on-going dlocate/locate/mlocate
confusion will be the release of a new version of dlocate that doesn't
use or depend on frcode or locate, but instead just uses a plain text
file and grep.
i have a few other things on my TODO list for dlocate. I'll get them
done and release a new version. hopefully this weekend if real life
doesn't intrude.
i think i'll also add a few more options to dlocate to take advantage of
GNU grep's ability to use different Matchers - from grep(1):
Matcher Selection
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE,
see below). (-E is specified by POSIX.)
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified
by POSIX.)
-G, --basic-regexp
Interpret PATTERN as a basic regular expression (BRE, see
below). This is the default.
-P, --perl-regexp
Interpret PATTERN as a Perl regular expression. This is
highly experimental and grep -P may warn of unimplemented
features.
and i'll support -w too:
-w, --word-regexp
Select only those lines containing matches that form whole
words. The test is that the matching substring must
either be at the beginning of the line, or preceded by a
non-word constituent character. Similarly, it must be
either at the end of the line or followed by a non-word
constituent character. Word-constituent characters are
letters, digits, and the underscore.
this will change the way that dlocate works (in that it does a regexp search
rather than a plain text search) but, IMO, that's far more useful. GNU locate
has an option to do a regexp search but the timing comparison gets even more
in favour of grep:
# time locate.findutils -d /var/lib/dlocate/dlocatedb -r
usr/share/doc/apache2.2-bin/changelog.gz
apache2.2-bin: /usr/share/doc/apache2.2-bin/changelog.gz
real 0m1.796s
user 0m1.640s
sys 0m0.012s
1.796 seconds for the first run after flushing disk cache, compared to
0.512 seconds for grep. grep is over 3 times faster.
on subsequent runs, the regexp locate still takes over 1.6 seconds,
while grep takes 0.034 seconds. over 47 times faster. obviously, and not
at all surprisingly, grepping an frcode database is not a very efficient
operation.
# time locate.findutils -d /var/lib/dlocate/dlocatedb -r
usr/share/doc/apache2.2-bin/changelog.gz
apache2.2-bin: /usr/share/doc/apache2.2-bin/changelog.gz
real 0m1.640s
user 0m1.628s
sys 0m0.008s
t...@tpo-laptop:~$ time dlocate / > /tmp/dump.txt
real 0m0.297s
user 0m0.204s
sys 0m0.088s
t...@tpo-laptop:~$ time dlocate / | gzip > /tmp/dump.txt.gz
real 0m0.540s
user 0m0.504s
sys 0m0.024s
Building the text DB is twice as slow for the gzipped version.
What if we optimize for speed instead?
$ time dlocate / | gzip --fast > /tmp/dump.txt.fast.gz
real 0m0.348s
user 0m0.488s
sys 0m0.020s
Now building the fast-gzipped text DB is about as fast as the raw text
one. However zgrepping it is slightly slower:
$ time zgrep /usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common
/tmp/dump.txt.gz
konqueror-plugin-khtmlsettings:
/usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common
real 0m0.143s
user 0m0.128s
sys 0m0.020s
$ time zgrep /usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common
/tmp/dump.txt.fast.gz
konqueror-plugin-khtmlsettings:
/usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common
real 0m0.167s
user 0m0.140s
sys 0m0.012s
If our emphasis is on search time, then maybe using a compression
algorithm that omptimizes on decompression would be ideal. I tried lzma,
which in default mode is slow with compression:
$ time dlocate / | lzma > /tmp/dump.txt.lzma
real 0m9.683s
user 0m9.509s
sys 0m0.088s
And slower than gzip when using the fast variant:
$ time dlocate / | lzma --fast > /tmp/dump.txt.fast.lzma
real 0m0.789s
user 0m0.740s
sys 0m0.040s
However grepping the lzma file is surprisingly slower than zgrep:
$ time ( unlzma -c /tmp/dump.txt.lzma | grep
/usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common )
konqueror-plugin-khtmlsettings:
/usr/share/doc/kde/HTML/en/konq-plugins/khtmlsettings/common
real 0m0.243s
user 0m0.204s
sys 0m0.044s
Maybe however it's not fair to compaire "zgrep" and "unlzma | grep" and
anyway, I wonder what it is that I'm actually measuring here...
I did not test UCL which claims to be one of the fastest decompressing
algorithms...
Thanks and greets Craig!
*t
--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org