bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase

Paul Eggert Wed, 05 Mar 2014 10:51:45 -0800

On 03/05/2014 07:11 AM, Norihiro Tanaka wrote:

I still believe that upper or lower case of a character should
also match title case

The (soon-to-be-fixed) gnulib regex code agrees with you, assuming thattowupper (X) agrees for all three values of X, because it uses (towupper(input) == towupper (pattern)). However, the most-plausible reading ofPOSIX does not agree with you, as it would require (input == pattern ||towlower (input) == pattern || towupper (input) == pattern), which meansa titlecase pattern will match only itself.

It seems pretty clear to me that the most-plausible reading of POSIX isbuggy, for this reason. No wonder so many implementations fail toconform to it.

I thought of a different way where gnulib/glibc regex does not conformto POSIX, and here there doesn't seem to be any ambiguity about it. Inthe POSIX locale when ignoring case, the pattern '[Z-a]' matches thedata 'Z', 'z', 'A', 'a', and the nonalphabetic characters like '^' thatcollate between 'Z' and 'a'. But the glibc regex code rejects thatpattern entirely. Conversely, in the same situation the glibc regexcode says '[A-z]' matches only alphabetic characters, whereas POSIX saysit should also match the nonalphabetic characters like '^' that collatebetween 'Z' and 'a'. It appears that nobody cares, as thisincompatibility has been present for years and I don't recall anyonecomplaining. Though it is weird that this means "grep PAT" can matchsome lines that "grep -i PAT" doesn't.

Here POSIX is not merely ambiguous, it's clearly disagreeing with commonpractice. It's not clear whether the bug is in POSIX or in theimplementation.

bug#16919: [PATCH] fix mismatch between dfa and regex for treatment of titlecase

Reply via email to