Java SE 7 apparently added flag (?U) to do the same thing as Python's (?u).
The new flag also affects Java's POSIX character class definitions such as
\p{Alnum}.
Note the difference in casing, and also that Java's (?U)\w follows UTS#18,
unlike Python's (?u)\w. Java has long supported a lowercase (?u) flag for
Unicode-aware case folding.
-- Steven Levithan
-----Original Message-----
From: Steven L.
Sent: Monday, March 19, 2012 12:21 PM
To: Erik Corry
Cc: [email protected]
Subject: Re: Full Unicode based on UTF-16 proposal
Steven Levithan wrote:
\w with Unicode should match [\p{L}\{Nd}_]. The best way to go for
[[:alnum:]], for compatibility reasons, would probably be
[\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive
(if you like that exact set) or a negative (many users will think it's
equivalent to \w with Unicode even though it isn't).
Although some regex libraries indeed implement the above, I've just looked
over UTS#18 Annex C [1], which requires that \w be equivalent to:
[\p{Alphabetic}\p{M}\p{Nd}\p{Pc}]
Note that \p{Alphabetic} should include more than just \p{L}. I'm not clear
on whether the differences from \p{L} are fully covered by the inclusion of
\p{M} in the above character class. I'm sure there are plenty of people here
with greater Unicode expertise than me who could clarify, though.
-- Steven Levithan
[1]: http://unicode.org/reports/tr18/#Compatibility_Properties
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss