On 02/03/2014 08:20 AM, Norihiro Tanaka wrote:
   echo 'LJ' | LC_ALL=en_US.UTF-8 grep -i Lj
   echo 'Lj' | LC_ALL=en_US.UTF-8 grep -i LJ

We expect that LJ and Lj are returned, respectively.  But both return
nothing.
Both test cases worked for me. I expect that you meant the cases with single characters, as in "echo lj | LC_ALL=en_US.UTF-8 grep -i Lj".

I have doubts about this patch, for several reasons.

1. It doesn't solve the problem from the ordinary user's point of view. For example, "echo lj | LC_ALL=en_US.UTF-8 src/grep -i Lj" will still output nothing, because the one-character pattern "Lj" does not match the two-character string "lj" even when the latter's two-letter case variants "Lj", "lJ", "LJ" are considered.

2. The characters in question are present in Unicode only for compatibility with previous standards; they're not intended to be used in new text. So this is a problem of the past, one that has mostly died out already.

3. Because of (2) the characters in question are rare, even in the languages where one might naively think they're useful. For example, the Croatian Wikipedia page for Ljubljana <http://hr.wikipedia.org/wiki/Ljubljana> consistently uses the two-character forms "Lj" and "lj", not the one-character forms "Lj" and "lj".

4. The solution doesn't generalize to similar problems in more-complicated orthographies. For example, in polytonic Greek when ignoring case ordinary users would expect "ᾄ" (U+1F84) to match not only "ᾌ" (U+1F8C), but also "Α" (U+0391), "ΑΙ" (U+0391, U+0399; two characters) and "Αι" (U+0391, U+03B9). Worse, this depends on context: often "ᾄ" should not match "Αι" when ignoring case. For details on this, please see Nick Nicholas's discussion "Titlecase and Adscripts" <http://www.tlg.uci.edu/~opoudjis/unicode/unicode_adscript.html>.

5. When POSIX specifies how to match a regular expression while ignoring case, it talks only about "uppercase or lowercase" <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02>. If we change 'grep' along the lines being suggested, we'll either have to change POSIX, or have the change take effect only if POSIXLY_CORRECT is not set.

Taking all this into consideration, it sounds like we should let sleeping dogs lie, i.e., that dfa.c should do the minimal work necessary needed to support traditional case-insensitive matching a la POSIX.



Reply via email to