On 03/05/2014 07:11 AM, Norihiro Tanaka wrote:
I still believe that upper or lower case of a character should
also match title case

The (soon-to-be-fixed) gnulib regex code agrees with you, assuming that towupper (X) agrees for all three values of X, because it uses (towupper (input) == towupper (pattern)). However, the most-plausible reading of POSIX does not agree with you, as it would require (input == pattern || towlower (input) == pattern || towupper (input) == pattern), which means a titlecase pattern will match only itself.

It seems pretty clear to me that the most-plausible reading of POSIX is buggy, for this reason. No wonder so many implementations fail to conform to it.

I thought of a different way where gnulib/glibc regex does not conform to POSIX, and here there doesn't seem to be any ambiguity about it. In the POSIX locale when ignoring case, the pattern '[Z-a]' matches the data 'Z', 'z', 'A', 'a', and the nonalphabetic characters like '^' that collate between 'Z' and 'a'. But the glibc regex code rejects that pattern entirely. Conversely, in the same situation the glibc regex code says '[A-z]' matches only alphabetic characters, whereas POSIX says it should also match the nonalphabetic characters like '^' that collate between 'Z' and 'a'. It appears that nobody cares, as this incompatibility has been present for years and I don't recall anyone complaining. Though it is weird that this means "grep PAT" can match some lines that "grep -i PAT" doesn't.

Here POSIX is not merely ambiguous, it's clearly disagreeing with common practice. It's not clear whether the bug is in POSIX or in the implementation.



Reply via email to