Paul Eggert wrote: > 1. It doesn't solve the problem from the ordinary user's point of view. > For example, "echo lj | LC_ALL=en_US.UTF-8 src/grep -i ?" will still > output nothing, because the one-character pattern "?" does not match > the two-character string "lj" even when the latter's two-letter case > variants "Lj", "lJ", "LJ" are considered. > > 2. The characters in question are present in Unicode only for > compatibility with previous standards; they're not intended to be used > in new text. So this is a problem of the past, one that has mostly died > out already. > > 3. Because of (2) the characters in question are rare, even in the > languages where one might naively think they're useful. For example, > the Croatian Wikipedia page for Ljubljana > <http://hr.wikipedia.org/wiki/Ljubljana> > consistently uses the two-character forms "Lj" and "lj", not the > one-character forms "?" and "?". > > 4. The solution doesn't generalize to similar problems in more-complicated > orthographies. For example, in polytonic Greek when ignoring case > ordinary users would expect "?" (U+1F84) to match not only "?" (U+1F8C), > but also "?" (U+0391), "??" (U+0391, U+0399; two characters) and "??" > (U+0391, U+03B9). Worse, this depends on context: often "?" should > not match "??" when ignoring case. For details on this, please see > Nick Nicholas's discussion "Titlecase and Adscripts" > <http://www.tlg.uci.edu/~opoudjis/unicode/unicode_adscript.html>. > > I think that it's because the problem is glibc doesn't define conversion > between two-character string "lj" and single-character Lj, "?" (U+1F8C) > and "?" (U+0391) etc.
For example, grep on HP-UX, I look like it's quitely compliant with POSIX, supports conversion between single-character "lj" and single-character "Lj", though dones't support conversion as above. I believe that the conversion rule is in compliance with the locale-data of libc is required. I look like the convesion beween "Lj", "lJ" and "LJ" is defined in UTF-8, but not defined between U+1F84 and U+0391 etc. > 5. When POSIX specifies how to match a regular expression while ignoring > case, it talks only about "uppercase or lowercase" > <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02>. > > If we change 'grep' along the lines being suggested, we'll either have > to change POSIX, or have the change take effect only if POSIXLY_CORRECT > is not set. The upper case of single-character "Lj" is "LJ" and the case is "lj". Thire conversion are also supported by towupper and towlower functions. Aharon Robbins wrote: > This is an issue for gawk. I seem that I have misunderstood. The problem doesn't reproduce on grep-2.16. It's taken by the patch for bug#16421, which removes GREP-oriented dfa.c.
