bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

Jim Meyering Mon, 23 Dec 2013 15:55:53 -0800

On Mon, Dec 23, 2013 at 3:30 PM, Eric Blake <[email protected]> wrote:
> On 12/23/2013 04:12 PM, Jim Meyering wrote:
>> Did you miss the "isascii" check in the new trivial_case_convert function?
>
> No.  But even with that check in place:
>
>> If you can describe circumstances in which the new patch malfunctions,
>> please do,
>> but everything you wrote seems to rely on a false assumption.
>
> No, it's a quite real complaint - your patch is broken for tr_TR.
>
>> E.g., your turkish-I example works fine with my patch.
>
> isascii('i') is true, but converting 'i' to '[iI]' is incorrect in the
> tr_TR locale.  Rather, the conversion must be to '[iİ]'; similarly, 'I'
> would be translated to '[Iı]'.  Neither of those conversions fit in 4
> bytes (since dotted-capital-I and dotless-lower-i are both multi-byte
> characters).
>
> Need help easily finding those characters on a non-Turkish keyboard?  I
> used:
> $ echo iI | LC_ALL=tr_TR.UTF-8 sed 's/\(.\)\(.\)/\U\1\L\2/'
>
> At any rate, prior to your patch, lower dotless i in the buffer gives an
> insensitive match to upper dotless I in the pattern:
>
> $ echo ı | LC_ALL=tr_TR.UTF-8 grep -i I || echo no match
> ı
>
> After your patch:
>
> $ echo ı | LC_ALL=tr_TR.UTF-8 src/grep -i I || echo no match
> no match
>
> Oops, you failed to match lower dotless i insensitively against upper
> dotless I, because upper dotless I is ascii, but you incorrectly
> converted it into the wrong pattern.


Thanks for dotting those 'i's.  While there is no risk of buffer
overrun, there would definitely be a problem with the tr_TR locale.
I will resolve it by removing the isascii check and performing
multibyte case conversion to form each [cC] pair.  Of course,
that will mean removing the "4 * byte-length-of-search-string"
buffer size limitation.

I will also add tests based on your examples.

bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

Reply via email to