Java SE 7 apparently added flag (?U) to do the same thing as Python's (?u). The new flag also affects Java's POSIX character class definitions such as \p{Alnum}.

Note the difference in casing, and also that Java's (?U)\w follows UTS#18, unlike Python's (?u)\w. Java has long supported a lowercase (?u) flag for Unicode-aware case folding.

-- Steven Levithan


-----Original Message----- From: Steven L.
Sent: Monday, March 19, 2012 12:21 PM
To: Erik Corry
Cc: [email protected]
Subject: Re: Full Unicode based on UTF-16 proposal

Steven Levithan wrote:
\w with Unicode should match [\p{L}\{Nd}_]. The best way to go for [[:alnum:]], for compatibility reasons, would probably be [\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive (if you like that exact set) or a negative (many users will think it's equivalent to \w with Unicode even though it isn't).

Although some regex libraries indeed implement the above, I've just looked
over UTS#18 Annex C [1], which requires that \w be equivalent to:

[\p{Alphabetic}\p{M}\p{Nd}\p{Pc}]

Note that \p{Alphabetic} should include more than just \p{L}. I'm not clear
on whether the differences from \p{L} are fully covered by the inclusion of
\p{M} in the above character class. I'm sure there are plenty of people here
with greater Unicode expertise than me who could clarify, though.

-- Steven Levithan

[1]: http://unicode.org/reports/tr18/#Compatibility_Properties

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to