Hi, In grep 2.6.3 from Ubuntu Maverick, I noticed that patterns using a character range in a bracket expression are extraordinarily slow. E.g.
schoen@sescenties:~$ time grep '[A-Z]' /usr/share/dict/words > /dev/null real 0m26.819s user 0m26.134s sys 0m0.012s schoen@sescenties:~$ time grep -v '[A-Z]' /usr/share/dict/words > /dev/null real 0m27.438s user 0m26.130s sys 0m0.008s (I first noticed this problem when trying to make a word list that excludes proper names for use in an anagram game.) What's really odd about this is that the same pattern written without a range, using an explicit list of characters, is dramatically faster: schoen@sescenties:~$ time grep '[ABCDEFGHIJKLMNOPQRSTUVWXYZ]' /usr/share/dict/words > /dev/null real 0m0.020s user 0m0.012s sys 0m0.004s schoen@sescenties:~$ time grep -v '[ABCDEFGHIJKLMNOPQRSTUVWXYZ]' /usr/share/dict/words > /dev/null real 0m0.033s user 0m0.032s sys 0m0.000s That is, using the explicit list of characters in the backet expression was three orders of magnitude faster! Using the character class [:upper:] is also fast: schoen@sescenties:~$ time grep '[:upper:]' /usr/share/dict/words > /dev/null real 0m0.026s user 0m0.024s sys 0m0.000s The disparity in the snapshot version released today, 2.7.43-ed23d, is much less extreme, but it's still present: schoen@sescenties:~/grep-2.7.43-ed23d/src$ time ./grep '[A-Z]' /usr/share/dict/words > /dev/null real 0m0.417s user 0m0.372s sys 0m0.016s schoen@sescenties:~/grep-2.7.43-ed23d/src$ time ./grep '[ABCDEFGHIJKLMNOPQRSTUVWXYZ]' /usr/share/dict/words > /dev/null real 0m0.025s user 0m0.024s sys 0m0.000s Now using the character range is only one order of magnitude slower, instead of three. In grep 2.6.3, the effect seems to depend linearly on the number of characters in the range. In 2.7.43-ed23d, there seems to be the same slowdown as a result of using any range, regardless of whether it contains dozens of characters or just three, two, or one: schoen@sescenties:~/grep-2.7.43-ed23d/src$ time ./grep '[X-Z]' /usr/share/dict/words > /dev/null real 0m0.428s user 0m0.388s sys 0m0.012s schoen@sescenties:~/grep-2.7.43-ed23d/src$ time ./grep '[XYZ]' /usr/share/dict/words > /dev/null real 0m0.032s user 0m0.012s sys 0m0.004s The problem affects egrep as well as grep. The time taken in each case seems to be proportional to the amount of input searched, so the difference is _not_ just in the initial setup. I don't know anything about the implementation here, but it seems unnecessary for character ranges to be so inefficient, since a preprocessor could easily turn each range in a bracket expression into an explicit list of characters... -- Seth David Schoen <[email protected]> | Qué empresa fácil no pensar en http://www.loyalty.org/~schoen/ | un tigre, reflexioné. http://vitanuova.loyalty.org/ | -- Borges, El Zahir
