Tom,

I would not overread this too much:-) There is no reason for the tr#18 to use any specific encoding in the specification, it's a perfect choice to simply pick the syntax notation that uses the code point value directly. However I don't think this "sample" syntax (or might be even further interpreted as a "recommendation") prevents the real world implementation from using whatever reasonable notation to achieve the same goal. It is the decision of JSR204 back to jdk1.5 that the Java language is to use pair of utf16 surrogates as the notation for the supplementary character. The supplementary character support in j.u.regex is part of the JSR204 specification. I would assume that the JSR204 export group back then believes that the Java Unicode escapes (\unnnn) and the pair are good enough as the notation for all Unicode code points, which I totally agree. That said, I still believe that \x{...} is a nice to have regex construct for people want to have a more "direct"
representation in their regex.

-Sherman

On 1-24-2011 19:14 07:14 PM, Tom Christiansen wrote:
Sherman wrote:

Introducing in the new perl style \x{...} as the hexadecimal notation
appears to be a nice-to-have enhancement (I will file a RFE to put this
request in record). But I don't think you can simply deny that the Java
Unicode escape sequences for UTF16 is NOT A "mechanism"/notation for
specifying any Unicode code point in Java RegEx, in which two
consecutive Unicode escapes that represent a legal utf16 surrogate pair
are interpreted as the corresponding supplementary code point.
I realize we've already gone over this, and I think we both agree it isn't
all that big of a deal, given that it is not altogether impossible under
the current system and given also that you will file an RFE about it.
(Plus it's not much code.)

But I've uncovered something in tr18 I hadn't noticed before.  In their
examples they specifically include a code point from above BMP, U+10450
SHAVIAN LETTER PEEP.  I now believe it significant that they did *not*
show this code point using a pair of UTF-16 code units as in \uD801\uDC50,
that they they instead invented a brand new syntax: \U00010450.

If you look back through the revisions to tr18, you'll see that this was
specifically added not all that long after Unicode went from 16 bits to
21 bits.  It first appeared in revision 7 of tr18, released 2003-05-15:

     http://www.unicode.org/reports/tr18/tr18-7.html#Hex_notation

To me this evidence strongly suggests that they really *do* intend that
folks with non-BMP code points *not* have to write a pair of surrogates'
hex values to specify a single logical character in regexes.  If they
thought two \uXXXX \uXXXX sufficed, they would not have needed to make the
update that they intentionally put in there for \uXXXXXXXX.  Because they
did so, I believe surrogate notation is not enough to meet this requirement.

It's just as well that Java can't do \UXXXXXXXX the way Python requires.
Java can't because its regexes have already adopted the Perl "translation"
escapes, including \Q and \U, which means \U is already taken.  I say it's
just as well because I don't like how you'd have to write out all 8 hex
digits every time (to avoid ambiguity), when in fact you will never need
them all for any 21-bit code point.  Because \x{XXX} has braces around it,
it's safe from meaning something else even if there are more hex digits
immediately afterwards.

--tom

     RL1.1 Hex Notation

     To meet this requirement, an implementation shall supply a mechanism
     for specifying any Unicode code point (from U+0000 to U+10FFFF).

     A sample notation for listing hex Unicode characters within strings is
     by prefixing four hex digits with "\u" and prefixing eight hex digits
     with "\U". This would provide for the following addition:

         <codepoint>  :=<character>

         <codepoint>  := ESCAPE U_SHORT_MARK
                        HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR

         <codepoint>  := ESCAPE U_LONG_MARK
                        HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR
                        HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR

         U_SHORT_MARK := "u"
         U_LONG_MARK := "U"

     Examples:

         [\u3040-\u309F \u30FC]  Match Hiragana characters, plus prolonged 
sound sign
         [\u00B2 \u2082]         Match superscript and subscript 2
         [a \U00010450]          Match "a" or U+10450 SHAVIAN LETTER PEEP

     Note: instead of [...\u3040...], an alternate syntax
           is [...\x{3040}...], as in Perl 5.6 and later.

     Note: more advanced regular expression engines can also offer the
           ability to use the Unicode character name for readability.
           See 2.5 Name Properties.

Reply via email to