grep is horriby slow in UTF-8 locales

Markus Kuhn Fri, 07 Nov 2003 08:39:40 -0800

On Red Hat 9:

$ grep --version
grep (GNU grep) 2.5.1
$ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt
Command exited with non-zero status 1
6.83user 0.07system 0:06.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (157major+34minor)pagefaults 0swaps
$ LC_ALL=POSIX time grep XYZ test.txt
Command exited with non-zero status 1
0.07user 0.09system 0:00.16elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (125major+24minor)pagefaults 0swaps


where test.tx is just http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
repeated 10 times.

It seems grep performs about 100x worse in a UTF-8 locale than in and
ASCII locale, even where the search strring contains no regex
metacharacters.

And fgrep is no better.

There is technically no reason, why grep should have to be any slower in
a UTF-8 locale than in a single-byte locale if the string does not even
contain any regex meta characters at all. In that case, UTF-8 can be
processed just like ASCII.

In UTF-8 mode, grep is also much slower than the equivalent Perl:

$ LC_ALL=en_GB.UTF-8 time perl -ne '/XYZ/ && print' test.txt
1.49user 0.05system 0:01.55elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (339major+45minor)pagefaults 0swaps
$ LC_ALL=POSIX time perl -ne '/XYZ/ && print' test.txt
1.17user 0.09system 0:01.28elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (322major+45minor)pagefaults 0swaps

Any suggestions? It would be nice not to be penalized like this by grep
for using a UTF-8 locale by default.

Markus

-- 
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

grep is horriby slow in UTF-8 locales

Reply via email to