On Tue, Apr 28, 2009 at 01:28:40PM -0400, Mark J. Reed wrote: > On Tue, Apr 28, 2009 at 10:22 AM, Larry Wall <la...@wall.org> wrote: > > Does anyone know offhand whether the Unicode Consortium has an explicit > > policy against use of punctuation in a charname? So far they only > > seem to use hyphen and parens, but I wonder to what extent we can > > depend on that... > > According to the 5.0.0 standard, section 4.8: > > "Unicode character names contain only uppercase Latin letters A > through Z, digits, space, and hyphen-minus." > > So it seems the notes in parentheses are not considered part of the char name.
Countering this, though: * The XML schema for the "Unicode Character Database in XML" [1] seems to allow parens in the character name property: character-name = xsd:string { pattern="([A-Z0-9 #\-\(\)]*)|(<control>)" } * The Unicode character name database [2] has parens in the name property field for many characters 000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;; * ICU doesn't seem to recognize the versions of the name without the parens (or if it does, I haven't been able to figure out the correct incantations to make it do so). Of course, it's very possible that I'm misreading the Unicode specifications, and the note that Mark cites would seem to be very explicit. But thus far in playing with this I've seen more indications that the parens are allowed or even required than I've seen that indicate they're excluded. Pm [1] http://www.unicode.org/reports/tr42/tr42-3.html#N66310 [2] http://unicode.org/Public/UNIDATA/UnicodeData.txt