Bracket expressions with character ranges are slow

Seth David Schoen Sat, 07 May 2011 22:24:02 -0700

Hi,

In grep 2.6.3 from Ubuntu Maverick, I noticed that patterns using
a character range in a bracket expression are extraordinarily slow.
E.g.


schoen@sescenties:~$ time grep '[A-Z]' /usr/share/dict/words > /dev/null

real    0m26.819s
user    0m26.134s
sys     0m0.012s
schoen@sescenties:~$ time grep -v '[A-Z]' /usr/share/dict/words > /dev/null

real    0m27.438s
user    0m26.130s
sys     0m0.008s

(I first noticed this problem when trying to make a word list that
excludes proper names for use in an anagram game.)

What's really odd about this is that the same pattern written without a
range, using an explicit list of characters, is dramatically faster:

schoen@sescenties:~$ time grep '[ABCDEFGHIJKLMNOPQRSTUVWXYZ]'
/usr/share/dict/words > /dev/null

real    0m0.020s
user    0m0.012s
sys     0m0.004s
schoen@sescenties:~$ time grep -v '[ABCDEFGHIJKLMNOPQRSTUVWXYZ]'
/usr/share/dict/words > /dev/null

real    0m0.033s
user    0m0.032s
sys     0m0.000s

That is, using the explicit list of characters in the backet
expression was three orders of magnitude faster!  Using the
character class [:upper:] is also fast:

schoen@sescenties:~$ time grep  '[:upper:]' /usr/share/dict/words > /dev/null

real    0m0.026s
user    0m0.024s
sys     0m0.000s

The disparity in the snapshot version released today, 2.7.43-ed23d,
is much less extreme, but it's still present:

schoen@sescenties:~/grep-2.7.43-ed23d/src$ time ./grep '[A-Z]' 
/usr/share/dict/words > /dev/null

real    0m0.417s
user    0m0.372s
sys     0m0.016s
schoen@sescenties:~/grep-2.7.43-ed23d/src$ time ./grep 
'[ABCDEFGHIJKLMNOPQRSTUVWXYZ]' /usr/share/dict/words > /dev/null

real    0m0.025s
user    0m0.024s
sys     0m0.000s

Now using the character range is only one order of magnitude slower,
instead of three.

In grep 2.6.3, the effect seems to depend linearly on the number of
characters in the range.  In 2.7.43-ed23d, there seems to be the
same slowdown as a result of using any range, regardless of whether
it contains dozens of characters or just three, two, or one:

schoen@sescenties:~/grep-2.7.43-ed23d/src$ time ./grep '[X-Z]'
/usr/share/dict/words > /dev/null

real    0m0.428s
user    0m0.388s
sys     0m0.012s
schoen@sescenties:~/grep-2.7.43-ed23d/src$ time ./grep '[XYZ]'
/usr/share/dict/words > /dev/null

real    0m0.032s
user    0m0.012s
sys     0m0.004s

The problem affects egrep as well as grep.

The time taken in each case seems to be proportional to the amount
of input searched, so the difference is _not_ just in the initial
setup.  I don't know anything about the implementation here, but
it seems unnecessary for character ranges to be so inefficient,
since a preprocessor could easily turn each range in a bracket
expression into an explicit list of characters...

-- 
Seth David Schoen <[email protected]> | Qué empresa fácil no pensar en
     http://www.loyalty.org/~schoen/   | un tigre, reflexioné.
     http://vitanuova.loyalty.org/     |            -- Borges, El Zahir

Bracket expressions with character ranges are slow

Reply via email to