The Unicode Standard distinguishes between Unicode Strings (16-bit) and UTF-16. In the former, which is often the form used in programming languages, a singleton value of 0xD800..0xDFFF is allowed, and is treated as if it were a reserved code point.
So you do get some funny cases, because 1. 0xD800 - 1 code point (degenerate surrogate) 2. 0xDC00 - 1 code point (degenerate surrogate) 3. 0xD800 0xDC00 - 1 code point (surrogate pair) 4. 0xDC00 0xD800 - 2 code points (2 successive degenerate surrogates). If you are working in UTF-8 or in UTF-32, then these cases wouldn't occur. They can't happen in UTF-8, and in UTF-32 both cases 3 and 4 are 2 successive degenerate surrogates. Mark *— Il meglio è l’inimico del bene —* On Sat, Jan 22, 2011 at 19:54, Tom Christiansen <tchr...@perl.com> wrote: > Sherman, > > In part 1, I outlined my thinking of why having to make end-users think > about represenation issues in regexes goes against if not perhaps the law, > certainly to mind the spirit of UTS(tr)#18 when it says that a compliant > "the regular expression engine provides support for Unicode characters as > basic logical units." > > Please understand that I don't think that is much of a big deal -- it's a > rather low priority bug at worse -- because when you look at it from a > particular perspective, it appears to be a surface-level matter only. > > (Also because it is easily addressed just by adding \x{XXX}, which is > both simple and safe.) > > It's not that big of a deal because as you yourself point out, Sherman, you > can still specify any code point, although you have to bend over sideways > to do it. But it doesn't at all affect behaviour, which is by far the more > important matter. > > These next two serialization concerns, however, are different. This time > they are not just surface issues. They are actual behavioral problems in > regexes that derive from the actual internal implementation of characters > in Java: > > ** Surrogate Bugs in Regexes > > ** CANON_EQ Bugs in Regexes with \\uXXXX > > I don't think users should have to know about those implementational > details, but if they don't, they will get several sorts of anomalous > behaviour. I therefore believe those two are both geniuine bugs. > > I know exactly what is causing the second one (code included), but > fixing it is going to require some code rearrangement and reworking. > > =========================== > Surrogate Bugs in Regexes > =========================== > > Here is one of them: > > Unicode UTF-16 > Code Point String Pattern Result > ========= ============== ========= ====== > U+1F47E "\uD83D\uDC7E" /^.$/ true > n/a "\uD83D" /^.$/ TRUE! > > I do not understand how that same pattern--which says to match > strings containing a single Unicode code point only--can test on > both those strings. That's why I believe the TRUE! result an error. > > Don't you? > > I understand that it brings up some tricky stuff. Consider: > > If you have a string "HL" where H is a high surrogate and L a low > surrogate, Java's regex engine correctly concludes that that string > "HL" exactly matches the pattern "^.$" in its entirety; it has just one > logical character in it. This is correct. It fails to match "^..$", > which is also correct and for the same reason. > > However, if you flip those around to get string "LH", it now exactly > matches the pattern "^..$" in its entirety, thus claiming it holds > exactly two characters even there are no legal > code points there! > > If you have just one of the two surrogates, either "H" or "L", both of > those will also match "^.$" just as "HL" does. That says that a single > surrogate is just as much a single logical character as a proper pair > of them together is just a single logical character. > > But that makes no senses at all. How can both be correct? Surely that > *must* be a bug? What am I not understanding here? > > I really think that rather than returning true for something that > isn't even a legal Unicode code point, it should instead either > > 1: raise an exception > > and/or > > 2: admit some pattern flag to deal with such cases > > I say this because you are not supposed to have to deal representation > and serialization issues in regexes, and this makes you think about them. > It also gives you bizarre answers even when you do think about them. > > ======================================= > CANON_EQ Bugs in Regexes with \\uXXXX > ======================================= > > Another place where you are forced to think about the internal > representation in Java regexes, is that they can behave differently if > you pass things in as "\\uXXXX" instead of as "\uXXXX". I don't think > that can be correct behaviour, either. > > The problem is that the CANON_EQ can no longer be trusted. If you compile > up these patterns with CANON_EQ, then it makes a difference whether you've > used a literal or a \u0000 form. Please consider these, as I believe that > FALSE! results below are all in error: > > String Pattern > w/CANON_EQ Result > ========= ============ ========= > A : "\u00E9" "^\u00E9$" true > B : "\u00E9" "^e\u0301$" true > A': "\u00E9" "^\\u00E9$" true > B': "\u00E9" "^e\\u0301$" FALSE! > > C : "e\u0301" "^\u00E9$" true > D : "e\u0301" "^e\u0301$" true > C': "e\u0301" "^\\u00E9$" FALSE! > D': "e\u0301" "^e\\u0301$" true > > The ABCD versions all use literals converted during the lexical > substitution phase, whereas the prime versions use UTF-16 code > units that get passed into the regex compiler for it to consider. > > (This second mechanism is indispensable to meet the requirement > of being able to code up any code point, and to facilitate reading > patterns written in ASCII but specifying trans-ASCII code points.) > > You get the same problem with octal notation: you can specify U+E9 as > "\351" for the prepass literal (which works), or as "\\0351" for the > regex engine to see (which fails just as \\u did): > > String Pattern > w/CANON_EQ Result > ========= ============ ========= > a : "\u00E9" "^\351$" true > a': "\u00E9" "^\\0351$" true > c : "e\u0301" "^\351$" true > c': "e\u0301" "^\\0351$" FALSE! > > As you might predict, using UTF-8 directly in your code and compiling with > "java -encoding UTF-8" behaves exactly as the non-prime "\uXXXX" versions > do, but which can be different from how the prime "\\uXXXX" version behave. > > >From looking at the code, I am sure I can reproduce this with \xXX escapes > as well. That's because you do the normalization reshuffle before you > actually compile the pattern, so you won't see the octal or hex escapes > when you're doing the normalization. The bug is right here in this code > right here, from around line 1500 of jdk1.7.0/java/util/regex/Pattern.java: > > /** > * Copies regular expression to an int array and invokes the parsing > * of the expression which will create the object tree. > */ > private void compile() { > // Handle canonical equivalences > if (has(CANON_EQ) && !has(LITERAL)) { > normalize(); > } else { > normalizedPattern = pattern; > } > patternLength = normalizedPattern.length(); > > // Copy pattern to int array for convenience > // Use double zero to terminate pattern > temp = new int[patternLength + 2]; > > Because things like \cC and \0XXX and \xXX and \uXXXX all get handled > *after* that point in the code, they are *not* the same as literals with > those values. This is a genuine problem. > > So again we have to think about how things are stored. It means that > you cannot just read in patterns that have had there non-ASCII converted > into \uXXXX escapes and have them work the same as having the literals in > there. Those are supposed to be the same as the literals, but they're not. > > This is quite apart from the--um, "syntactic infelicity"?--of the mismatch > between how octal excapes are specified in the lexical substitution pass > versus how they're specified in the regex engine. That, I wouldn't quite > call a bug so much as an unexpected wrinkle. I do fix this in my regex > rewriter, BTW. > > (There are "syntactic infelicities" with \cC, too. It is a bit too > undiscerning, producing things that aren't guaranteed to be control > characters because it blindly xors whatever follows it with 64. For > example, \c} is = and \c= is }, \cé is © and \c© is é, etc. ) > > This is message is far too long again, so I will discuss your comments > regarding the j.l.Character class in part 3 of 3, to be sent later on. > > Thanks again! > > --tom >