On Tue, Apr 28, 2009 at 01:28:40PM -0400, Mark J. Reed wrote:
> On Tue, Apr 28, 2009 at 10:22 AM, Larry Wall <la...@wall.org> wrote:
> > Does anyone know offhand whether the Unicode Consortium has an explicit
> > policy against use of punctuation in a charname?  So far they only
> > seem to use hyphen and parens, but I wonder to what extent we can
> > depend on that...
> 
> According to the 5.0.0 standard, section 4.8:
> 
> "Unicode character names contain only uppercase Latin letters A
> through Z, digits, space, and hyphen-minus."
> 
> So it seems the notes in parentheses are not considered part of the char name.

Countering this, though:

* The XML schema for the "Unicode Character Database in XML" [1] 
  seems to allow parens in the character name property:

    character-name = xsd:string { pattern="([A-Z0-9 #\-\(\)]*)|(<control>)" } 

* The Unicode character name database [2] has parens in the
  name property field for many characters

    000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;

* ICU doesn't seem to recognize the versions of the name without
  the parens (or if it does, I haven't been able to figure out the
  correct incantations to make it do so).

Of course, it's very possible that I'm misreading the Unicode
specifications, and the note that Mark cites would seem to be
very explicit.  But thus far in playing with this I've seen
more indications that the parens are allowed or even required
than I've seen that indicate they're excluded.

Pm

[1] http://www.unicode.org/reports/tr42/tr42-3.html#N66310
[2] http://unicode.org/Public/UNIDATA/UnicodeData.txt

Reply via email to