Now I will discuss the more interesting of my two functions, the one that handles charclass escapes such as those given in RL1.2a. The particular Level 1 place where this code is relevant is RL1.2a's Annex C Compatibility Properties, RL 1.4 Simple Word Boundaries, and RL1.6 Line Boundaries.
I do not try to change the overloaded property names in \p{...} that Java is using for something other than what Unicode says they must mean. I only change the single-letter backslash escapes, but for those I do implement all the Standard Recommendations. It is not too hard. I could not satisfy the list of 11 required Unicode properties from RL1.2, of which Java currently provides 3 only. I did manage to cover all backslash charclass escapes listed in RL1.2a. I believe that apart from the 8 (now 7) missing properties from RL1.2, between the previously discussed function from part 2 and the one I describe below I address all other concerns I have raised about Java's Level 1 conformance. Again, I present this not as code I think you should necessarily use. I merely cite it as an existence proof that these corrections are prefectly feasible, and for the most part, fairly easily done. Sometimes just knowing that something *can* be done makes all the difference. I do not discuss matters of backwards compatibility. There are many possible approaches on that, and I will address them in a future letter. CHARCLASS ESCAPES ================= The function translates the following regex-only escapes into equivalent constructs that allow Java to meet the RL1.2a requirements as I understand those to be. The first row change how existing Java patterns (mis)behave, while the second row is new to Java: \s \S \w \W \b \B \d \D \v \V \h \H \R \X Java's biggest problem is \w vs \b; Java's second biggest problem in this area is \s. (Capitalize those as needed to produce the complements, about which the same applies.) I'll take these charclass escapes one set at a time. For \d, I use the Standard Recommendation of \p{Decimal_Number}. This of all the charclass escapes seems to have a sufficiently useful alternate definition that one might wish to make some allowance. Indeed, Annex C already does so: under the POSIX Compatibility column, it shows that one is allowed to use [0-9] instead of this. For \s, I use the strict Unicode White_Space definition as defined by RL1.2. Although required for Level 1 compliance, this is not a property that Java currently has direct access to. Indeed, Java currently has no mechanism for detecting Unicode whitespace. The following table of J(ava) vs P(erl) results illustrates the problem: Code point Regex 001A 0085 00A0 2029 J P J P J P J P \s 1 1 0 1 0 1 0 1 \pZ 0 0 0 0 1 1 1 1 \p{Zs} 0 0 0 0 1 1 0 0 \p{Space} 1 1 0 1 0 1 0 1 \p{Blank} 0 0 0 0 0 1 0 0 \p{Whitespace} - 1 - 1 - 1 - 1 \p{javaWhitespace} 1 - 0 - 0 - 1 - \p{javaSpaceChar} 0 - 0 - 1 - 1 - I therefore enumerate the small set directly: \s [\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000] The \v and \h are the Unicode Vertical_Space and Horizontal_Space properties, for programmer convenience and also compatibility with Perl. Those are \v [\u000A-\u000D\u0085\u2028\u2029] \h [\u0009\u0020\u00A0\u1680\u180E\u2000-\u200A\u202F\u205F\u3000] The complements -- \S, \V, and \H, follow directly from the definitions given above. For \w, I have taken the nearest possible approximation to Annex C's definition, which is all of \p{alpha} \p{gc=Mark} \p{digit} \p{gc=Connector_Punctuation} I say nearest possible approximation, but what I have devised is input- output equivalent to that definition. That is, it produces zero false negatives and zero false positives when checked against every single possible valid Unicode code point. This should suffice. Now, normally, I would not be able to do this because I have no access to the full Unicode Alphabetic property, which includes various but not all Mark characters, and some circled letters. Fortunately, the addition of all Mark characters to \w removes the problem of distinguishing the Alphabetic Marks from the non-Alphabetic ones. I have even devised a way to handle the circled letters. My resulting translation of \w is \w [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}] and so my translation of \W is the complement of that bracketed character class: \W [^\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}] It is far more reasonable to require ASCII-only people to write (\p{ASCII}\w) if that is what they want than it is to make Unicode people write the long version I give above when they want it! But I have several different ideas for how to appease the Backwards Compatibility Police. More on that later. The \R is to meet one of the two strong recommendations for Level 1 compliance, found under RL1.6 Line Boundaries. The \X is to meet RL1.2a, and also to go part of the way toward meeting RL2.2 Extended Grapheme Clusters. It would be easy to Legacy Grapheme Clusters exactly. For that you would need merely replace \X with (?:\PM\pM*+), which is easily enough done. However, I wanted Extended Grapheme Clusters, which is *much* harder. In fact, it is this hard: EGC = ( CR LF ) | ( Prepend* ( L+ | (L* ( ( V | LV ) V* | LVT ) T*) | T+ | [^ Control CR LF ] ) ( Extend | SpacingMark )* ) | . Where the L, V, LV, LVT, and T bits are defined in terms of the Hangul Syllable Type property. I came very close, but absent full access to the HST property values, it is not going to always work correctly on Korean. It will work correctly on Western languages, however. I will not show the expansion here, as it has that Medusa-look property. :) But it does work except for border-cases in East Asian languages. The way \b works is related to RL1.4 Simple Word Boundaries. How this ties into RL1.2a is a bit confusing, since the language on this in RL1.2a's Annex C is less than perfectly clear. However, the ambiguities disappear when we consider supporting text from elsewhere in tr18. Under the second, POSIX Compatible column of Annex C, it gives "n/a" for \b. Under the first, Standard Recommendation column, for \b it gives "Default Word Boundaries". Then in the third, Comments column is says: If there is a requirement that \b align with \w, then it would use the approximation above instead. See [UAX29], also WordBreakTest. Note that different functions are used for programming language identifier boundaries. See also [UAX31]. So we seem to be left wondering what the relationship is supposed to be between \b and \w. However, further down in 2.2 Extended Grapheme Clusters we find this text: \b{w} Zero-width match at a Unicode word boundary. Note that this is different than \b alone, which corresponds to \w and \W. See Annex C: Compatibility Properties. That clearly asserts that "\b alone corresponds to \w and \W." That is its historical sense, and that is what I have implemented in the function that replaces charclass escapes with something Java can handle. Assuming that one shall use my replacement definition of \w as given above, that makes \b and \B work correctly if you write them this way: Orig Rewrite \b (?:(?<=\w)(?!\w)|(?<!\w)(?=\w)) \B (?:(?<=\w)(?=\w)|(?<!\w)(?!\w)) I have a gigantic test suite in which I exhaustively tested all possible combinations -- literally millions of them -- to verify that these were equivalent statements. Since I have proven that they are, I then replaced the \w in them with the longer bracket character class given about. \b (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}ww\p{So}])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}])) \B (?:(?<=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}])(?=[\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}])|(?<![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}])(?![\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}])) It may have been merely unreasonable to make people write the replacements for \w, but it is simply beyond the pale to ask that they on their own should write the replacements for \b. And now at last they can in Java expect \b\w+\b to match the string "élève".And in its entirely, no less. This concludes my letter discussing the code that brought us to each other's attention in the first place. --tom