Hi,
Just noticed that webrev url was pointing to the blenderrev. The webrev
is at
http://cr.openjdk.java.net/~sherman/7014640/webrev
Btw, this one has been approved by CCC.
thanks,
-Sherman
On 04/21/2012 12:56 AM, Xueming Shen wrote:
Hi
Here are the webrev and blenderrev for the proposed change to add 5
new regex constructs \R \v \V \h \V.
\R: recommended by Unicode Regex TR#18 for matching all line ending
characters and sequences, is equivalent to
( \u000D\u000A | [\u000A\u000B\u000C\u000D\u0085\u2028\u2029] )
\h, \v, \H and \V:
matches any character considered to (not) be horizontal/vertical
whitespace.
Webrev:
http://cr.openjdk.java.net/~sherman/7014640/blenderrev.html
Blenderrev:
http://cr.openjdk.java.net/~sherman/7014640/blenderrev.html
new Pattern api
http://cr.openjdk.java.net/~sherman/7014640/Pattern.html
Here are couple notes regarding the spec/implementation.
(1) \v was implemented as \u000B ('\013'), but not documented (did not
appear in our API
doc as one supported construct, such as \t \r \n...). To define \v as
a "general" construct for
all vertical whitespace characters might trigger some compatibility
concerns (more characters
are now matched by \v). But given this is a never documented
implementation detail and the
\u000B is still being matched by \v, I would consider this as an
acceptable behavior change.
(2) a predefined character class can appear inside another character
class, for example
you can have [...\v...], however, since it represents a "class" of
character, so it can't be
a start or end code point of a range inside a class, so you can have
[a-b], but you can't
have [\h-...] or [...-\h] (exception will be thrown). But for \v,
since it was implemented
as \u000B (VT), you were able to put it as a start or end value of a
range, I feel it'd be
better still keep it the way it worked before, so [\v-\v] works and
will match the VT in
this implementation.
(3) The newly added \h\v\H\V constructs are all "Unicode version" of
character classes, the
rest of the "predefined character class" family (\d\D\s\S\w\W) are
ASCII only, you will have to
turn on flag UNICODE_CHARACTER_CLASS to get the Unicode version of
these constructs. This
is kinda of inconsistent. Perl's corresponding constructs work in a
similar way, all Perl's \d\D\s\S
\w\W\v\V\h\H work in Unicode version, and to have a \a modifier to
turn the \d\D\s\S\w\W
back to ASCII mode but not for \h\v\H\V. We had the discussion back
into JDK7 regarding the
ASCII vs Unicode for these constructs, the decision then was to keep
these predefined character
classes (and POSIX character classes) ASCII by default, to have a flag
UNICODE_CHARACTER_CLASS
to turn them into Unicode version. Given there is NOT an ASCII version
in Perl and we didn't
have ASCII version in Java regex to trigger compatibility concern, I
feel it might be better to
just have a simple Unicode version of \h\v\H\V.
(4)\R is not a character class, since it matched \r\n.
This one will need to go through ccc process.
Thanks,
-Sherman