On Red Hat 9: $ grep --version grep (GNU grep) 2.5.1 $ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt Command exited with non-zero status 1 6.83user 0.07system 0:06.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (157major+34minor)pagefaults 0swaps $ LC_ALL=POSIX time grep XYZ test.txt Command exited with non-zero status 1 0.07user 0.09system 0:00.16elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (125major+24minor)pagefaults 0swaps
where test.tx is just http://www.unicode.org/Public/UNIDATA/UnicodeData.txt repeated 10 times. It seems grep performs about 100x worse in a UTF-8 locale than in and ASCII locale, even where the search strring contains no regex metacharacters. And fgrep is no better. There is technically no reason, why grep should have to be any slower in a UTF-8 locale than in a single-byte locale if the string does not even contain any regex meta characters at all. In that case, UTF-8 can be processed just like ASCII. In UTF-8 mode, grep is also much slower than the equivalent Perl: $ LC_ALL=en_GB.UTF-8 time perl -ne '/XYZ/ && print' test.txt 1.49user 0.05system 0:01.55elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (339major+45minor)pagefaults 0swaps $ LC_ALL=POSIX time perl -ne '/XYZ/ && print' test.txt 1.17user 0.09system 0:01.28elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (322major+45minor)pagefaults 0swaps Any suggestions? It would be nice not to be penalized like this by grep for using a UTF-8 locale by default. Markus -- Markus Kuhn, Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
