On Red Hat 9:
$ grep --version grep (GNU grep) 2.5.1 $ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt Command exited with non-zero status 1 6.83user 0.07system 0:06.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (157major+34minor)pagefaults 0swaps $ LC_ALL=POSIX time grep XYZ test.txt Command exited with non-zero status 1 0.07user 0.09system 0:00.16elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (125major+24minor)pagefaults 0swaps
where test.tx is just http://www.unicode.org/Public/UNIDATA/UnicodeData.txt repeated 10 times.
Wow, I dunno what's going on here. Here are the results on my system (also RedHat 9):
$ grep --version grep (GNU grep) 2.5.1 $ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt Command exited with non-zero status 1 1.14user 0.04system 0:01.19elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (156major+32minor)pagefaults 0swaps $ LC_ALL=POSIX time grep XYZ test.txt Command exited with non-zero status 1 0.01user 0.03system 0:00.03elapsed 102%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (125major+25minor)pagefaults 0swaps
It seems grep performs about 100x worse in a UTF-8 locale than in and ASCII locale, even where the search strring contains no regex metacharacters.
grep is slower on my system, but it doesn't appear to be as bad as on your system.
In UTF-8 mode, grep is also much slower than the equivalent Perl:
$ LC_ALL=en_GB.UTF-8 time perl -ne '/XYZ/ && print' test.txt 1.49user 0.05system 0:01.55elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (339major+45minor)pagefaults 0swaps $ LC_ALL=POSIX time perl -ne '/XYZ/ && print' test.txt 1.17user 0.09system 0:01.28elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (322major+45minor)pagefaults 0swaps
$ LC_ALL=en_GB.UTF-8 time perl -ne '/XYZ/ && print' test.txt 0.30user 0.01system 0:00.33elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (341major+45minor)pagefaults 0swaps $ LC_ALL=POSIX time perl -ne '/XYZ/ && print' test.txt 0.19user 0.06system 0:00.24elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (325major+44minor)pagefaults 0swaps
Any suggestions? It would be nice not to be penalized like this by grep for using a UTF-8 locale by default.
Sorry buddy, I have no idea :(
-- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
