On Tue, Apr 28, 2009 at 2:27 PM, Patrick R. Michaud <pmich...@pobox.com> wrote: >> According to the 5.0.0 standard, section 4.8: >> >> "Unicode character names contain only uppercase Latin letters A >> through Z, digits, space, and hyphen-minus." >> >> So it seems the notes in parentheses are not considered part of the char >> name. > > Countering this, though: > > * The XML schema for the "Unicode Character Database in XML" [1] > seems to allow parens in the character name property: > > character-name = xsd:string { pattern="([A-Z0-9 #\-\(\)]*)|(<control>)" }
Also '#', though I see no character names containing that symbol. But all the parentheses I see in the list of character names are surrounding lowercase letters, which are explicitly disallowed not only in the spec I quoted, but in the XML scheme definition you quote above. e.g. 00C6 LATIN CAPITAL LETTER AE (ash) > * The Unicode character name database [2] has parens in the > name property field for many characters > > 000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;; That's not the name property field. The Unicode character name is field 1 ("<control>", in this case). The field whose value is "LINE FEED (LF)" is the Unicode_1_Name field, wihch for control characters supplies the ISO 6429 name. -- Mark J. Reed <markjr...@gmail.com>