bug#16631: Consideration of title case on case-insensitive matching

Norihiro Tanaka Fri, 07 Feb 2014 08:52:03 -0800

Paul Eggert wrote:
> 1. It doesn't solve the problem from the ordinary user's point of view.
> For example, "echo lj | LC_ALL=en_US.UTF-8 src/grep -i ?" will still
> output nothing, because the one-character pattern "?" does not match
> the two-character string "lj" even when the latter's two-letter case
> variants "Lj", "lJ", "LJ" are considered.
> 
> 2. The characters in question are present in Unicode only for
> compatibility with previous standards; they're not intended to be used
> in new text. So this is a problem of the past, one that has mostly died
> out already.
> 
> 3. Because of (2) the characters in question are rare, even in the
> languages where one might naively think they're useful. For example,
> the Croatian Wikipedia page for Ljubljana 
> <http://hr.wikipedia.org/wiki/Ljubljana>
> consistently uses the two-character forms "Lj" and "lj", not the
> one-character forms "?" and "?".
> 
> 4. The solution doesn't generalize to similar problems in more-complicated
> orthographies. For example, in polytonic Greek when ignoring case
> ordinary users would expect "?" (U+1F84) to match not only "?" (U+1F8C),
> but also "?" (U+0391), "??" (U+0391, U+0399; two characters) and "??" 
> (U+0391, U+03B9). Worse, this depends on context: often "?" should
> not match "??" when ignoring case. For details on this, please see
> Nick Nicholas's discussion "Titlecase and Adscripts"
> <http://www.tlg.uci.edu/~opoudjis/unicode/unicode_adscript.html>.
> 
> I think that it's because the problem is glibc doesn't define conversion
> between two-character string "lj" and single-character Lj, "?" (U+1F8C)
> and "?" (U+0391) etc.

For example, grep on HP-UX, I look like it's quitely compliant with POSIX,
supports conversion between single-character "lj" and single-character
"Lj", though dones't support conversion as above.

I believe that the conversion rule is in compliance with the locale-data
of libc is required.  I look like the convesion beween "Lj", "lJ" and "LJ"
is defined in UTF-8, but not defined between U+1F84 and U+0391 etc.

> 5. When POSIX specifies how to match a regular expression while ignoring 
> case, it talks only about "uppercase or lowercase" 
> <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02>.
>  
> If we change 'grep' along the lines being suggested, we'll either have 
> to change POSIX, or have the change take effect only if POSIXLY_CORRECT 
> is not set.

The upper case of single-character "Lj" is "LJ" and the  case is "lj".
Thire conversion are also supported by towupper and towlower functions.

Aharon Robbins wrote:
> This is an issue for gawk.

I seem that I have misunderstood.  The problem doesn't reproduce on
grep-2.16.  It's taken by the patch for bug#16421, which removes
GREP-oriented dfa.c.

bug#16631: Consideration of title case on case-insensitive matching

Reply via email to