Sherman wrote: > Introducing in the new perl style \x{...} as the hexadecimal notation > appears to be a nice-to-have enhancement (I will file a RFE to put this > request in record). But I don't think you can simply deny that the Java > Unicode escape sequences for UTF16 is NOT A "mechanism"/notation for > specifying any Unicode code point in Java RegEx, in which two > consecutive Unicode escapes that represent a legal utf16 surrogate pair > are interpreted as the corresponding supplementary code point.
I realize we've already gone over this, and I think we both agree it isn't all that big of a deal, given that it is not altogether impossible under the current system and given also that you will file an RFE about it. (Plus it's not much code.) But I've uncovered something in tr18 I hadn't noticed before. In their examples they specifically include a code point from above BMP, U+10450 SHAVIAN LETTER PEEP. I now believe it significant that they did *not* show this code point using a pair of UTF-16 code units as in \uD801\uDC50, that they they instead invented a brand new syntax: \U00010450. If you look back through the revisions to tr18, you'll see that this was specifically added not all that long after Unicode went from 16 bits to 21 bits. It first appeared in revision 7 of tr18, released 2003-05-15: http://www.unicode.org/reports/tr18/tr18-7.html#Hex_notation To me this evidence strongly suggests that they really *do* intend that folks with non-BMP code points *not* have to write a pair of surrogates' hex values to specify a single logical character in regexes. If they thought two \uXXXX \uXXXX sufficed, they would not have needed to make the update that they intentionally put in there for \uXXXXXXXX. Because they did so, I believe surrogate notation is not enough to meet this requirement. It's just as well that Java can't do \UXXXXXXXX the way Python requires. Java can't because its regexes have already adopted the Perl "translation" escapes, including \Q and \U, which means \U is already taken. I say it's just as well because I don't like how you'd have to write out all 8 hex digits every time (to avoid ambiguity), when in fact you will never need them all for any 21-bit code point. Because \x{XXX} has braces around it, it's safe from meaning something else even if there are more hex digits immediately afterwards. --tom RL1.1 Hex Notation To meet this requirement, an implementation shall supply a mechanism for specifying any Unicode code point (from U+0000 to U+10FFFF). A sample notation for listing hex Unicode characters within strings is by prefixing four hex digits with "\u" and prefixing eight hex digits with "\U". This would provide for the following addition: <codepoint> := <character> <codepoint> := ESCAPE U_SHORT_MARK HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR <codepoint> := ESCAPE U_LONG_MARK HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR U_SHORT_MARK := "u" U_LONG_MARK := "U" Examples: [\u3040-\u309F \u30FC] Match Hiragana characters, plus prolonged sound sign [\u00B2 \u2082] Match superscript and subscript 2 [a \U00010450] Match "a" or U+10450 SHAVIAN LETTER PEEP Note: instead of [...\u3040...], an alternate syntax is [...\x{3040}...], as in Perl 5.6 and later. Note: more advanced regular expression engines can also offer the ability to use the Unicode character name for readability. See 2.5 Name Properties.