On Sat, Jan 11, 2014 at 6:15 AM, Pádraig Brady <[email protected]> wrote: > On 01/11/2014 11:33 AM, Pádraig Brady wrote: >> On 01/11/2014 05:40 AM, Jim Meyering wrote: >>> On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <[email protected]> wrote: >>>>> I wonder might this faster path be restricted to a safer but very common >>>>> input subset of: >>>>> >>>>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80)) >>>> >>>> That sounds like a good approach. >>>> Now I need another test case, to demonstrate that the current code can >>>> cause trouble. >>> >>> Hmm... after thinking about this for a while and actually trying to >>> break the current code (did not find a way to demonstrate a regression), >>> I have concluded that the current approach is no worse than the prior >>> one of matching a case-mapped regexp vs. each case-mapped input line. >>> >>> That's not to say that it's perfect, of course. >>> The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example >>> from gnulib's test-ulc-casecmp.c is a great example: this matches: >>> >>> printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf >>> '\x6A\xCC\x8C\xCC\xA3')" >>> >>> but this does not, yet probably should: >>> >>> printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf >>> '\x6A\xCC\x8C\xCC\xA3')" >>> >>> Can you see a way to demonstrate a regression? >> >> Oh right, it doesn't handle these cases already. >> Fair enough I don't see a regression then. > > This is also a good summary of stuff to consider with case: > http://www.unicode.org/faq/casemap_charprop.html > > So picking another case situation from there: > "in the Greek script, capital sigma (U+03A3) is the uppercase form of both > the regular (U+03C2) and final (U+03C3) lowercase sigma." > > One can see that sed handles this: > $ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/' > ςσΣΣ > $ printf '\u03A3\n' | sed 's/.*/&\L&/' > Σσ > > Though I was surprised the grep (2.14) didn't match any combo of these > $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)" > $ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)" > $ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)" > > Not a regression of course.
Thank you for the reference and the fine examples. I'll add the latter as a known-failing test.
