bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

Jim Meyering Sat, 11 Jan 2014 09:59:10 -0800

On Sat, Jan 11, 2014 at 6:15 AM, Pádraig Brady <[email protected]> wrote:
> On 01/11/2014 11:33 AM, Pádraig Brady wrote:
>> On 01/11/2014 05:40 AM, Jim Meyering wrote:
>>> On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <[email protected]> wrote:
>>>>> I wonder might this faster path be restricted to a safer but very common 
>>>>> input subset of:
>>>>>
>>>>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
>>>>
>>>> That sounds like a good approach.
>>>> Now I need another test case, to demonstrate that the current code can
>>>> cause trouble.
>>>
>>> Hmm... after thinking about this for a while and actually trying to
>>> break the current code (did not find a way to demonstrate a regression),
>>> I have concluded that the current approach is no worse than the prior
>>> one of matching a case-mapped regexp vs. each case-mapped input line.
>>>
>>> That's not to say that it's perfect, of course.
>>> The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example
>>> from gnulib's test-ulc-casecmp.c is a great example: this matches:
>>>
>>>     printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf
>>> '\x6A\xCC\x8C\xCC\xA3')"
>>>
>>> but this does not, yet probably should:
>>>
>>>     printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf 
>>> '\x6A\xCC\x8C\xCC\xA3')"
>>>
>>> Can you see a way to demonstrate a regression?
>>
>> Oh right, it doesn't handle these cases already.
>> Fair enough I don't see a regression then.
>
> This is also a good summary of stuff to consider with case:
> http://www.unicode.org/faq/casemap_charprop.html
>
> So picking another case situation from there:
>   "in the Greek script, capital sigma (U+03A3) is the uppercase form of both
>    the regular (U+03C2) and final (U+03C3) lowercase sigma."
>
> One can see that sed handles this:
>   $ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/'
>   ςσΣΣ
>   $ printf '\u03A3\n' | sed 's/.*/&\L&/'
>   Σσ
>
> Though I was surprised the grep (2.14) didn't match any combo of these
>   $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)"
>   $ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)"
>   $ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)"
>
> Not a regression of course.


Thank you for the reference and the fine examples.
I'll add the latter as a known-failing test.

bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

Reply via email to