Seth David Schoen wrote: > In grep 2.6.3 from Ubuntu Maverick, I noticed that patterns using > a character range in a bracket expression are extraordinarily slow. > E.g. > > schoen@sescenties:~$ time grep '[A-Z]' /usr/share/dict/words > /dev/null > > real 0m26.819s > user 0m26.134s > sys 0m0.012s > schoen@sescenties:~$ time grep -v '[A-Z]' /usr/share/dict/words > /dev/null > > real 0m27.438s > user 0m26.130s > sys 0m0.008s > > (I first noticed this problem when trying to make a word list that > excludes proper names for use in an anagram game.) > > What's really odd about this is that the same pattern written without a > range, using an explicit list of characters, is dramatically faster: > > schoen@sescenties:~$ time grep '[ABCDEFGHIJKLMNOPQRSTUVWXYZ]' > /usr/share/dict/words > /dev/null > > real 0m0.020s > user 0m0.012s > sys 0m0.004s > schoen@sescenties:~$ time grep -v '[ABCDEFGHIJKLMNOPQRSTUVWXYZ]' > /usr/share/dict/words > /dev/null > > real 0m0.033s > user 0m0.032s > sys 0m0.000s > > That is, using the explicit list of characters in the backet > expression was three orders of magnitude faster! Using the > character class [:upper:] is also fast: > > schoen@sescenties:~$ time grep '[:upper:]' /usr/share/dict/words > /dev/null > > real 0m0.026s > user 0m0.024s > sys 0m0.000s > > The disparity in the snapshot version released today, 2.7.43-ed23d, > is much less extreme, but it's still present: > > schoen@sescenties:~/grep-2.7.43-ed23d/src$ time ./grep '[A-Z]' > /usr/share/dict/words > /dev/null > > real 0m0.417s > user 0m0.372s > sys 0m0.016s > schoen@sescenties:~/grep-2.7.43-ed23d/src$ time ./grep > [ABCDEFGHIJKLMNOPQRSTUVWXYZ]' /usr/share/dict/words > /dev/null > > real 0m0.025s > user 0m0.024s > sys 0m0.000s > > Now using the character range is only one order of magnitude slower, > instead of three. > > In grep 2.6.3, the effect seems to depend linearly on the number of > characters in the range. In 2.7.43-ed23d, there seems to be the > same slowdown as a result of using any range, regardless of whether > it contains dozens of characters or just three, two, or one: > > schoen@sescenties:~/grep-2.7.43-ed23d/src$ time ./grep '[X-Z]' > /usr/share/dict/words > /dev/null > > real 0m0.428s > user 0m0.388s > sys 0m0.012s > schoen@sescenties:~/grep-2.7.43-ed23d/src$ time ./grep '[XYZ]' > /usr/share/dict/words > /dev/null > > real 0m0.032s > user 0m0.012s > sys 0m0.004s > > The problem affects egrep as well as grep. > > The time taken in each case seems to be proportional to the amount > of input searched, so the difference is _not_ just in the initial > setup. I don't know anything about the implementation here, but > it seems unnecessary for character ranges to be so inefficient, > since a preprocessor could easily turn each range in a bracket > expression into an explicit list of characters...
Thanks for the detailed report. I presume you are using a multi-byte locale like en_US.utf8. Note that A-Z and ABCDEFGHIJKLMNOPQRSTUVWXYZ are not equivalent. The A-Z range may contain many other characters, including lower case letters. If your search strings and data are all single-byte, you may prefer to use a single-byte locale. If you set LC_CTYPE=C in your environment, all of the above will run quickly. That would avoid the inherent expense of using a UTF-8 locale. When you use the C locale, grep can take advantage of the simpler locale and works more like you would expect: $ LC_CTYPE=C env time ./grep '[XYZ]' /usr/share/dict/words > /dev/null 0.02user 0.00system 0:00.02elapsed 96%CPU (0avgtext+0avgdata 3120maxresident)k 0inputs+0outputs (0major+227minor)pagefaults 0swaps $ LC_CTYPE=C env time ./grep '[X-Z]' /usr/share/dict/words > /dev/null 0.02user 0.00system 0:00.02elapsed 92%CPU (0avgtext+0avgdata 3136maxresident)k 0inputs+0outputs (0major+228minor)pagefaults 0swaps Contrast that with the 4 seconds it takes when I'm using a common multi-byte locale: $ LC_CTYPE=en_US.utf8 env time ./grep '[X-Z]' /usr/share/dict/words > /dev/null 3.33user 0.77system 0:04.12elapsed 99%CPU (0avgtext+0avgdata 4208maxresident)k 0inputs+0outputs (0major+353974minor)pagefaults 0swaps
