Tom,
I would not overread this too much:-) There is no reason for the tr#18
to use any specific
encoding in the specification, it's a perfect choice to simply pick the
syntax notation that
uses the code point value directly. However I don't think this "sample"
syntax (or might
be even further interpreted as a "recommendation") prevents the real
world implementation
from using whatever reasonable notation to achieve the same goal. It is
the decision of
JSR204 back to jdk1.5 that the Java language is to use pair of utf16
surrogates as the
notation for the supplementary character. The supplementary character
support in
j.u.regex is part of the JSR204 specification. I would assume that the
JSR204 export
group back then believes that the Java Unicode escapes (\unnnn) and the
pair are good
enough as the notation for all Unicode code points, which I totally
agree. That said, I still
believe that \x{...} is a nice to have regex construct for people want
to have a more "direct"
representation in their regex.
-Sherman
On 1-24-2011 19:14 07:14 PM, Tom Christiansen wrote:
Sherman wrote:
Introducing in the new perl style \x{...} as the hexadecimal notation
appears to be a nice-to-have enhancement (I will file a RFE to put this
request in record). But I don't think you can simply deny that the Java
Unicode escape sequences for UTF16 is NOT A "mechanism"/notation for
specifying any Unicode code point in Java RegEx, in which two
consecutive Unicode escapes that represent a legal utf16 surrogate pair
are interpreted as the corresponding supplementary code point.
I realize we've already gone over this, and I think we both agree it isn't
all that big of a deal, given that it is not altogether impossible under
the current system and given also that you will file an RFE about it.
(Plus it's not much code.)
But I've uncovered something in tr18 I hadn't noticed before. In their
examples they specifically include a code point from above BMP, U+10450
SHAVIAN LETTER PEEP. I now believe it significant that they did *not*
show this code point using a pair of UTF-16 code units as in \uD801\uDC50,
that they they instead invented a brand new syntax: \U00010450.
If you look back through the revisions to tr18, you'll see that this was
specifically added not all that long after Unicode went from 16 bits to
21 bits. It first appeared in revision 7 of tr18, released 2003-05-15:
http://www.unicode.org/reports/tr18/tr18-7.html#Hex_notation
To me this evidence strongly suggests that they really *do* intend that
folks with non-BMP code points *not* have to write a pair of surrogates'
hex values to specify a single logical character in regexes. If they
thought two \uXXXX \uXXXX sufficed, they would not have needed to make the
update that they intentionally put in there for \uXXXXXXXX. Because they
did so, I believe surrogate notation is not enough to meet this requirement.
It's just as well that Java can't do \UXXXXXXXX the way Python requires.
Java can't because its regexes have already adopted the Perl "translation"
escapes, including \Q and \U, which means \U is already taken. I say it's
just as well because I don't like how you'd have to write out all 8 hex
digits every time (to avoid ambiguity), when in fact you will never need
them all for any 21-bit code point. Because \x{XXX} has braces around it,
it's safe from meaning something else even if there are more hex digits
immediately afterwards.
--tom
RL1.1 Hex Notation
To meet this requirement, an implementation shall supply a mechanism
for specifying any Unicode code point (from U+0000 to U+10FFFF).
A sample notation for listing hex Unicode characters within strings is
by prefixing four hex digits with "\u" and prefixing eight hex digits
with "\U". This would provide for the following addition:
<codepoint> :=<character>
<codepoint> := ESCAPE U_SHORT_MARK
HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR
<codepoint> := ESCAPE U_LONG_MARK
HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR
HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR
U_SHORT_MARK := "u"
U_LONG_MARK := "U"
Examples:
[\u3040-\u309F \u30FC] Match Hiragana characters, plus prolonged
sound sign
[\u00B2 \u2082] Match superscript and subscript 2
[a \U00010450] Match "a" or U+10450 SHAVIAN LETTER PEEP
Note: instead of [...\u3040...], an alternate syntax
is [...\x{3040}...], as in Perl 5.6 and later.
Note: more advanced regular expression engines can also offer the
ability to use the Unicode character name for readability.
See 2.5 Name Properties.