Re: [PATCH] grep -i: work also when converting to lower-case inflates byte count

Paolo Bonzini Sat, 23 Jun 2012 08:06:44 -0700

>> Turkish lowercase i-with-dot is shorter than the uppercase, and
>> uppercase I-without-dot is shorter than the lowercase.
>
> Thanks!  With that, I created a test case and watched it malfunction.
> That demonstrated a couple of invalid (in that case) assumptions in the
> new code.  I suspect that this bug strikes only in relatively few locales.


Yes, my recollection is that it only happens on Turkish and Azeri
locales.  Some more strangeness with case mappings occurs in
Lithuanian locales (an accented i conserves the dot, or something like
that!) but I think glibc doesn't implement that.

Anyhow, thanks for the work on this longstanding bug!  I think this
was the last regression that was introduced in 2.6, so this is a major
achievement in the 2.x series (perhaps we should have called the 2.6
release 3.0).

The next step would be to add support for Unicode character classes,
and look into converting other multibyte locales to/from UTF-8 in
order to speed up the matches.

Paolo

Re: [PATCH] grep -i: work also when converting to lower-case inflates byte count

Reply via email to