On 01/11/2014 05:40 AM, Jim Meyering wrote: > On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <[email protected]> wrote: >>> I wonder might this faster path be restricted to a safer but very common >>> input subset of: >>> >>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80)) >> >> That sounds like a good approach. >> Now I need another test case, to demonstrate that the current code can >> cause trouble. > > Hmm... after thinking about this for a while and actually trying to > break the current code (did not find a way to demonstrate a regression), > I have concluded that the current approach is no worse than the prior > one of matching a case-mapped regexp vs. each case-mapped input line. > > That's not to say that it's perfect, of course. > The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example > from gnulib's test-ulc-casecmp.c is a great example: this matches: > > printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf > '\x6A\xCC\x8C\xCC\xA3')" > > but this does not, yet probably should: > > printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf '\x6A\xCC\x8C\xCC\xA3')" > > Can you see a way to demonstrate a regression?
Oh right, it doesn't handle these cases already. Fair enough I don't see a regression then. +1 Pádraig.
