This message explains precisely how Java fails to provide any way to access these four required properties from RL1.2:
Alphabetic Lowercase Uppercase Whitespace Since Java does not provide them *by any name*, and RL1.2 specifically includes those four in its "To meet this requirement, an implementation shall provide at least..." list, Java does not conform to RL1.2. Since Java does not meet RL1.2, it therefore cannot be Level 1 conformant per tr18, and so this claim from j.u.r.Pattern's javadoc is incorrect: This class is in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents. ============================================================ Sherman wrote: > As regarding the POSIX properties. In Java RegEx Unicode > Alphabetic, Lowercase or Whitespace properties are supported by > using \p{javaLetter}, \p{javaLowerCase}, \p{javaUpperCase} or > \p{javaWhitespace}. That has certainly not been my experience. All the things you say are the same are things that I believe are *not* the same. ============ Alphabetic ============ The Unicode Alphabetic property is not the same as its Letter property. All \p{Letter} code points are \p{Alphabetic}, but not all \p{Alphabetic} code points are \p{Letter}. According to http://download.oracle.com/javase/7/docs/api/java/lang/Character.html#isLetter(int) j.l.Character.isLetter() -- and therefore \p{javaLetter} -- is: A character is considered to be a letter if its general category type, provided by getType(codePoint), is any of the following: UPPERCASE_LETTER LOWERCASE_LETTER TITLECASE_LETTER MODIFIER_LETTER OTHER_LETTER That is the same as \pL, the Unicode GC=Letter property. But Unicode Alphabetic is *not* the same Unicode Letter; rather the Unicode Alphabetic property is defined by tr44 to be: Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic The last two are where javaLetter fails. It does not detect Letter_Number code points (Nl), nor does it consider all the many Other_Alphabetic code points. Other_Alphabetic is one of those internal Unicode properties used from the UCD PropList.txt file exclusively used to generate the Alphabetic property. It includes various code points of general category Mn and Mc, but there are many other Mn and Mc code points which are *not* Other_Alphabetic. As of Unicode 6.0, there are 811 code points in the Basic Multilingual Plane (plane 0) which are \p{Alphabetic} but not \pL, by which I mean that they have the Alphabetic property but lack the Letter property. There are also 195 such code points up in the so-called "astral" planes (planes 1-16). Consider this code point: <Ⅰ> U+2160 ROMAN NUMERAL ONE In Java, you will find that the string "\u2160" (which is a Nl or Letter_Number code point) fails to match the pattern \pL, which is correct, but also fails to match the property \p{javaLetter}. If javaLetter were truly the Unicode Letter property, it would succeed, but since it fails, those are not the same. Therefore you cannot say that the Unicode Alphabetic property is the same as the javaLetter property; they are different things. To demonstrate how it should work using U+2160, ROMAN NUMERAL ONE: $ perl -le 'print chr(0x2160) =~ /\pL/ || 0' 0 $ perl -le 'print chr(0x2160) =~ /\p{Alphabetic}/ || 0' 1 ========================= Uppercase and Lowercase ========================= The Unicode Lowercase property is not the same as its Ll (Lowercase_Letter) property. Again, although all \p{Ll} code points are \p{Lowercase}, not all \p{Lowercase} code points are also \p{Ll}. As with Alphabetic, we have others to consider: Lowercase = Lowercase_Letter + Other_Lowercase Uppercase = Uppercase_Letter + Other_Uppercase Specifically, there are 159 code \p{Lowercase} code points in the BMP which are not also \p{Ll}. The same situation occurs with Lu (Uppercase_Letter) versus Uppercase: there are 42 BMP code points which are \p{Uppercase} but which are not \p{Lu}. Testing in Java with "\u2160", the \p{javaUpperCase} property fails to match, but ought to if is meant to represent the Unicode Uppercase property. Therefore they are not the same. Again demonstrating with U+2160, ROMAN NUMERAL ONE: $ perl -le 'print chr(0x2160) =~ /\p{Lu}/ || 0' 0 $ perl -le 'print chr(0x2160) =~ /\p{Uppercase}/ || 0' 1 These things *do* matter. tr18 requires that Lowercase and Uppercase must be supported for Level 1 conformance. Moreover, tr44 specifically tells us that these are *not* to be considered second-class citizens among Unicode character properties, according to: http://www.unicode.org/reports/tr44/tr44-4.html#Properties Derived character properties are not considered second-class citizens among Unicode character properties. They are defined to make implementation of important algorithms easier to state. Included among the first-class derived properties important for such implementations are: Uppercase, Lowercase, XID_Start, XID_Continue, Math, and Default_Ignorable_Code_Point, all defined in DerivedCoreProperties.txt, as well as derived properties for the optimization of normalization, defined in DerivedNormalizationProps.txt. BTW, I don't believe that Java supports the Unicode casing aliases, so that you can use \p{LC} or \p{L&} as a convenient shorthand for [\p{Lu}\p{Lt}\p{Lu}]. I don't know why it doesn't, but it would be nice to see them supported. ============ Whitespace ============ The Unicode Whitespace property is not the same as Java's \p{javaWhitespace} property per your assertion. Unicode defines 25 code points as having the Whitespace property. Of these 25, \p{javaWhitespace} fails to correctly match code points U+85, U+A0, U+2007, and U+202F. Therefore, one cannot use \p{javaWhitespace} to detect Unicode whitespace. It does not *matter* that it is documented not to match those in Java. Because Unicode documents its Whitespace property to indeed match those, javaWhitespace and Unicode Whitespace are not the same thing. This came up at work. We had Java code that thought to use the Java whitespace property when tokenizing Unicode plain text. The corpus, the PubMed Central Open Access set, is just *full* of U+A0, NON-BREAK SPACE. This needs to be treated as the whitespace that Unicode says it is. We were getting wrong answers until we through out the Java whitespace definition and used the Unicode one. ====================== Namespace Collisions ====================== Sherman wrote: > The \p{Lower/Upper/ASCII/Alpha...}, as noticed, are clearly > specified by the Java RegEx specification[1] that are for > US_ASCII only Although I don't mind that ASCII should work only on ASCII, for the others, there is a big problem. If you look in the list of official property aliases: http://www.unicode.org/Public/UNIDATA/PropertyAliases.txt You will see that those all have defined meanings in Unicode: Alpha ; Alphabetic Lower ; Lowercase Upper ; Uppercase Those are completely official names. The Unicode Alpha property is by definition identical to its Alphabetic property, its Lower the same as its Lowercase, and its Upper the same as its Upper. As already explained, these are in turn different from \pL, \p{Ll}, and \p{Lu}. It is highly regrettable that you have used ASCII-only definitions for those, but that is not what they are. And you cannot even get away with claiming that you are using POSIX compatible versions detailed in http://www.unicode.org/reports/tr18/#Compatibility_Properties The following are recommended assignments for compatibility property names, for use in Regular Expressions. There are two alternatives: the Standard Recommendation and the POSIX Compatible versions. Applications should use the former wherever possible. The latter is modified to meet the formal requirements of [POSIX], and also to maintain (as much as possible) compatibility with the POSIX usage in practice. You cannot use the non-Standard Recommendation here, because those do not have non-Unicode alternatives. While we're on it, that says that the Unicode Space property must be equivalent to the Unicode Whitespace property, which we have shown is different from the javaWhitepace property. So Java is using names that Unicode defines in ways that are completely differently from the what Unicode says those names must all mean. I find that particularly wicked. ================ Loose Matching ================ By the way, I find it very counterintuitive that I cannot use javaWhiteSpace for javaWhitespace but must use javaLowerCase not javaLowercase. Both sets should be allowed according to the "strongly recommended" practice of loose matching of property names. tr18 section 1.2 gives this "strong recommendation": The recommended names for UCD properties and property values are in PropertyAliases.txt [Prop] and PropertyValueAliases.txt [PropValue]. There are both abbreviated names and longer, more descriptive names. It is strongly recommended that both names be recognized, and that loose matching of property names be used, whereby the case distinctions, whitespace, hyphens, and underbar are ignored. And under RL1.2 it again reads: Note: Because it is recommended that the property syntax be lenient as to spaces, casing, hyphens and underbars, any of the following should be equivalent: \p{Lu}, \p{lu}, \p{uppercase letter}, \p{uppercase letter}, \p{Uppercase_Letter}, and \p{uppercaseletter} This is explained in more detail in 5.7 Matching Rules from tr44, which reads in part... http://www.unicode.org/reports/tr44/tr44-4.html#Matching_Rules When matching Unicode character property names and values, it is strongly recommended that all Property and Property Value Aliases be recognized. For best results in matching, rather than using exact binary comparisons, the following loose matching rules should be observed. [...] Property aliases and property value aliases are symbolic values. When comparing them, use loose matching rule UAX44-LM3. UAX44-LM3. Ignore case, whitespace, underscore ('_'), and hyphens. * "linebreak" is equivalent to "Line_Break" or "Line-break" * "lb=BA" is equivalent to "lb=ba" or "LB=BA" I'm sorry this is so long, but I didn't want to break it up since it is all closely related. Thank you for all your hard work! --tom