Re: Bracket expressions with character ranges are slow

Seth David Schoen Wed, 18 May 2011 10:44:36 -0700

Paolo Bonzini writes:

> On 05/09/2011 12:58 AM, Seth David Schoen wrote:
> >Thanks, that's definitely the source of the problem.  I appreciate
> >the explanation.  I did some more tests with this and found that
> >searches with bracket expressions in my UTF-8 locale are slow when
> >the elements inside the brackets contain both a single-byte character
> >and a multi-byte character.  So [ab], [üçå], [美国], and [ł天] are all
> >fast, but [人a] and [aö] are quite slow.
> >
> >Maybe I need to think more about how UTF-8 works, but I don't quite
> >see why these bracket expressions need to be as slow as they are.
> 
> You are correct that these cases (unlike ranges) can be optimized.


Suppose grep had a preprocessor that converted any bracket
expression containing elements of different byte sizes, whether
[美国a] or a range not all of whose characters are a single byte,
into a parenthesized alternation like (美|国|a).  Would this use
more memory, constituting a space-for-time tradeoff?  If not, is
there some other reason not to do this?  Is there some other
case in which matching the alternation becomes less efficient than
matching the bracket expression?

I realize this is only potentially possible for egrep, at least at
the surface level of rewriting the regular expression.

-- 
Seth David Schoen <[email protected]> | Qué empresa fácil no pensar en
     http://www.loyalty.org/~schoen/   | un tigre, reflexioné.
     http://vitanuova.loyalty.org/     |            -- Borges, El Zahir

Re: Bracket expressions with character ranges are slow

Reply via email to