bug#16631: Consideration of title case on case-insensitive matching

Paul Eggert Thu, 06 Feb 2014 15:42:57 -0800

On 02/03/2014 08:20 AM, Norihiro Tanaka wrote:

   echo 'LJ' | LC_ALL=en_US.UTF-8 grep -i Lj
   echo 'Lj' | LC_ALL=en_US.UTF-8 grep -i LJ


We expect that LJ and Lj are returned, respectively.  But both return
nothing.

Both test cases worked for me. I expect that you meant the cases withsingle characters, as in "echo ǉ | LC_ALL=en_US.UTF-8 grep -i ǈ".


I have doubts about this patch, for several reasons.

1. It doesn't solve the problem from the ordinary user's point of view.For example, "echo lj | LC_ALL=en_US.UTF-8 src/grep -i ǈ" will stilloutput nothing, because the one-character pattern "ǈ" does not match thetwo-character string "lj" even when the latter's two-letter casevariants "Lj", "lJ", "LJ" are considered.

2. The characters in question are present in Unicode only forcompatibility with previous standards; they're not intended to be usedin new text. So this is a problem of the past, one that has mostly diedout already.

3. Because of (2) the characters in question are rare, even in thelanguages where one might naively think they're useful. For example, theCroatian Wikipedia page for Ljubljana<http://hr.wikipedia.org/wiki/Ljubljana> consistently uses thetwo-character forms "Lj" and "lj", not the one-character forms "ǈ" and "ǉ".

4. The solution doesn't generalize to similar problems inmore-complicated orthographies. For example, in polytonic Greek whenignoring case ordinary users would expect "ᾄ" (U+1F84) to match not only"ᾌ" (U+1F8C), but also "Α" (U+0391), "ΑΙ" (U+0391, U+0399; twocharacters) and "Αι" (U+0391, U+03B9). Worse, this depends on context:often "ᾄ" should not match "Αι" when ignoring case. For details on this,please see Nick Nicholas's discussion "Titlecase and Adscripts"<http://www.tlg.uci.edu/~opoudjis/unicode/unicode_adscript.html>.

5. When POSIX specifies how to match a regular expression while ignoringcase, it talks only about "uppercase or lowercase"<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02>.If we change 'grep' along the lines being suggested, we'll either haveto change POSIX, or have the change take effect only if POSIXLY_CORRECTis not set.

Taking all this into consideration, it sounds like we should letsleeping dogs lie, i.e., that dfa.c should do the minimal work necessaryneeded to support traditional case-insensitive matching a la POSIX.

bug#16631: Consideration of title case on case-insensitive matching

Reply via email to