On Mon, Dec 23, 2013 at 3:30 PM, Eric Blake <[email protected]> wrote: > On 12/23/2013 04:12 PM, Jim Meyering wrote: >> Did you miss the "isascii" check in the new trivial_case_convert function? > > No. But even with that check in place: > >> If you can describe circumstances in which the new patch malfunctions, >> please do, >> but everything you wrote seems to rely on a false assumption. > > No, it's a quite real complaint - your patch is broken for tr_TR. > >> E.g., your turkish-I example works fine with my patch. > > isascii('i') is true, but converting 'i' to '[iI]' is incorrect in the > tr_TR locale. Rather, the conversion must be to '[iİ]'; similarly, 'I' > would be translated to '[Iı]'. Neither of those conversions fit in 4 > bytes (since dotted-capital-I and dotless-lower-i are both multi-byte > characters). > > Need help easily finding those characters on a non-Turkish keyboard? I > used: > $ echo iI | LC_ALL=tr_TR.UTF-8 sed 's/\(.\)\(.\)/\U\1\L\2/' > > At any rate, prior to your patch, lower dotless i in the buffer gives an > insensitive match to upper dotless I in the pattern: > > $ echo ı | LC_ALL=tr_TR.UTF-8 grep -i I || echo no match > ı > > After your patch: > > $ echo ı | LC_ALL=tr_TR.UTF-8 src/grep -i I || echo no match > no match > > Oops, you failed to match lower dotless i insensitively against upper > dotless I, because upper dotless I is ascii, but you incorrectly > converted it into the wrong pattern.
Thanks for dotting those 'i's. While there is no risk of buffer overrun, there would definitely be a problem with the tr_TR locale. I will resolve it by removing the isascii check and performing multibyte case conversion to form each [cC] pair. Of course, that will mean removing the "4 * byte-length-of-search-string" buffer size limitation. I will also add tests based on your examples.
