Hi again, Odd names for collating elements --------------------------------
I wrote: > $ echo 'ch and more' | LANG=cy_GB.UTF-8 sed 's/[[.ch.]]/<MATCHED>/' > sed: -e expression #1, char 21: Invalid collation character > > Odd, no? It did seem odd, especially since the POSIX documentation uses examples like this all the time (usually [.ch.] from pre-1994 Spanish). For example [1]: collating-element <ch> from "<c><h>" collating-element <e-acute> from "<acute><e>" collating-element <ll> from "ll" I was missing something obvious: in GNU locales, the collating element has a hyphenated name. > collating-symbol <zs> > collating-element <z-s> from "<U007A><U0073>" $ echo 'ch and more' | LANG=cy_GB.UTF-8 sed 's/[[.c-h.]]/<MATCHED>/' <MATCHED> and more So there’s the workaround. I think this is a real bug: POSIX 1.2008 says [2]: A collating symbol is a collating element enclosed within bracket-period ( "[." and ".]" ) delimiters. Collating elements are defined as described in Collation Order . Conforming applications shall represent multi-character collating elements as collating symbols when it is necessary to distinguish them from a list of the individual characters that make up the multi-character collating element. For example, if the string "ch" is a collating element defined using the line: collating-element <ch-digraph> from "<c><h>" in the locale definition, the expression "[[.ch.]]" shall be treated as an RE containing the collating symbol 'ch', while "[ch]" shall be treated as an RE matching 'c' or 'h' . Collating symbols are recognized only inside bracket expressions. If the string is not a collating element in the current locale, the expression is invalid. In other words, in the “collating-element <z-s> from "<U007A><U0073>"” line, it is not the z-s that names the collating symbol in regexps. This makes sense, since otherwise how could anyone write portable regular expressions? Writing [:alpha:] in Hungarian ------------------------------ Andras wrote: > "zs" in particular is causing trouble for grep: > > % echo zs | LANG=C grep '^[^a-z]*$' > % echo zs | LANG=hu_HU.UTF-8 grep '^[^a-z]*$' > zs Any program using such constructions without LC_COLLATE=C or similar is IMHO buggy because of exactly this problem. With some C libraries (though not current glibc, luckily), in English, [^a-z] matches A but not Z or vice versa [3]. (Current POSIX leaves the behavior unspecified.) The . notation seems to work here: % echo zs | LANG=hu_HU.UTF-8 grep '^[^a-[.z-s.]]*$' % Once the regexp engine is fixed, that regexp would become '^[^a-[.zs.]]*$'. Bracket expressions match collating elements -------------------------------------------- Andras wrote: > % echo ty | LANG=C grep '^[s-u]*$' > % echo ty | LANG=hu_HU.UTF-8 grep '^[s-u]*$' > ty POSIX is unambiguous about this: bracket expressions match collating elements, not characters. I can imagine situations where this would be helpful and situations where it would be unhelpful. Mostly, it just seems difficult to do any other way, since otherwise what would the ranges mean? The simplest workaround is to use LC_COLLATE=C (or en_US.UTF-8, or C.UTF-8 once glibc learns that, or whatever locale has the behavior you want). Computers are dumb ------------------ Andras wrote: > 1. grep has no way of knowing whether a "zs" sequence is a "single letter" > or two letters, because the combination can occur in compound words without > becoming a "zs" letter; for example, in "fúvószenekar" ("fúvós" + > "zenekar"), it's simply an "s" and a "z" letter next to each other. There > may even exist words that make (a different) sense either way, but I can't > think of any right now. Are there simple heuristics that would make this condition easy to discover? For example, vowels that would never appear before a true "sz" letter, things like that? I am just curious; please feel free to e-mail me privately about this. This sounds like a (hard to fix) bug in the collation algorithm, but not a reason not to make 'sort' follow the conventions of the language. An argument could be made that although 'sort' should use the customary collation order, regexp matching should not. The strongest counterargument I know of is that it is hard to find a different rule that would be useful for regular expressions in, e.g., Hebrew. . matches a character --------------------- Andras wrote: > % echo zs | LANG=hu_HU.UTF-8 grep "^[a-z]*$" > zs > % echo azsa | LANG=hu_HU.UTF-8 grep "^a.a$" > % echo azsa | LANG=hu_HU.UTF-8 grep "^a[^a-z]a$" > azsa POSIX is unambiguous about this, too: . matches a single character, not a collating element. I assume this is mostly for speed. If you want to match an arbitrary collating element, it is not obvious to me how to. [[:print:]] would capture the most important ones. A related collating element bug ------------------------------- There is some other ugliness: any single-byte character like [.e.] works fine, but multi-byte characters like [.é.] do not. $ echo 'é and more' | LANG=en_US.UTF-8 sed 's/[[.é.]]/<MATCHED>/' sed: -e expression #1, char 21: Invalid collation character I think a fix to the [[.zs.]] bug would automatically fix this as well. Hope that helps, Jonathan [1] http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_03_02_01 [2] item 4 from http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05 [3] http://mail-index.netbsd.org/tech-userlevel/2008/08/08/msg000986.html -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org