Paolo Bonzini writes: > On 05/09/2011 12:58 AM, Seth David Schoen wrote: > >Thanks, that's definitely the source of the problem. I appreciate > >the explanation. I did some more tests with this and found that > >searches with bracket expressions in my UTF-8 locale are slow when > >the elements inside the brackets contain both a single-byte character > >and a multi-byte character. So [ab], [üçå], [美国], and [ł天] are all > >fast, but [人a] and [aö] are quite slow. > > > >Maybe I need to think more about how UTF-8 works, but I don't quite > >see why these bracket expressions need to be as slow as they are. > > You are correct that these cases (unlike ranges) can be optimized.
Suppose grep had a preprocessor that converted any bracket expression containing elements of different byte sizes, whether [美国a] or a range not all of whose characters are a single byte, into a parenthesized alternation like (美|国|a). Would this use more memory, constituting a space-for-time tradeoff? If not, is there some other reason not to do this? Is there some other case in which matching the alternation becomes less efficient than matching the bracket expression? I realize this is only potentially possible for egrep, at least at the surface level of rewriting the regular expression. -- Seth David Schoen <[email protected]> | Qué empresa fácil no pensar en http://www.loyalty.org/~schoen/ | un tigre, reflexioné. http://vitanuova.loyalty.org/ | -- Borges, El Zahir
