Jim Meyering writes: > Thanks for the detailed report. > I presume you are using a multi-byte locale like en_US.utf8. > Note that A-Z and ABCDEFGHIJKLMNOPQRSTUVWXYZ are not equivalent. > The A-Z range may contain many other characters, including lower case letters. > If your search strings and data are all single-byte, > you may prefer to use a single-byte locale. > If you set LC_CTYPE=C in your environment, all of the above > will run quickly. > > That would avoid the inherent expense of using a UTF-8 locale. > When you use the C locale, grep can take advantage of the > simpler locale and works more like you would expect:
Thanks, that's definitely the source of the problem. I appreciate the explanation. I did some more tests with this and found that searches with bracket expressions in my UTF-8 locale are slow when the elements inside the brackets contain both a single-byte character and a multi-byte character. So [ab], [üçå], [美国], and [ł天] are all fast, but [人a] and [aö] are quite slow. Maybe I need to think more about how UTF-8 works, but I don't quite see why these bracket expressions need to be as slow as they are. In particular, egrep REs using | are quite fast even when the number of bytes in the expressions on either side of the | is different, like egrep '(mew|meow)' If I manually rewrite the bracket expressions to use the vertical pipe, they get fast again even if the number of bytes in the characters varies: schoen@sescenties:~$ time egrep '[人a]' /usr/share/dict/words > /dev/null real 1m7.780s user 1m7.164s sys 0m0.028s schoen@sescenties:~$ time egrep '(人|a)' /usr/share/dict/words > /dev/null real 0m0.048s user 0m0.024s sys 0m0.000s The same difference appears for [öa] and (ö|a). (It's true that my /usr/share/dict/words doesn't contain 人, but I wrote a program to add 人 and a few other multibyte characters to random positions within 1/10 of the words in /usr/share/dict/words, and egrep's behavior seemed correct and fast when running over that file too.) I think [人a] and (人|a) specify equivalent patterns, so I don't see why one of them has to be much slower than the other. Wouldn't it be possible to rewrite the format into the latter? -- Seth David Schoen <[email protected]> | Qué empresa fácil no pensar en http://www.loyalty.org/~schoen/ | un tigre, reflexioné. http://vitanuova.loyalty.org/ | -- Borges, El Zahir
