Re: Codereview request for 7014640: To add a metachar \R for line ending and character classes for vertical/horizontal ws \v \V \h \H

Xueming Shen Tue, 01 May 2012 11:07:37 -0700

Hi,

Just noticed that webrev url was pointing to the blenderrev. The webrevis at


http://cr.openjdk.java.net/~sherman/7014640/webrev

Btw, this one has been approved by CCC.

thanks,
-Sherman

On 04/21/2012 12:56 AM, Xueming Shen wrote:

Hi
Here are the webrev and blenderrev for the proposed change to add 5new regex constructs \R \v \V \h \V.
\R:  recommended by Unicode Regex TR#18 for matching all line ending
    characters and sequences, is equivalent to
    ( \u000D\u000A | [\u000A\u000B\u000C\u000D\u0085\u2028\u2029] )

\h, \v, \H and \V:
matches any character considered to (not) be horizontal/verticalwhitespace.
Webrev:
http://cr.openjdk.java.net/~sherman/7014640/blenderrev.html

Blenderrev:
http://cr.openjdk.java.net/~sherman/7014640/blenderrev.html

new Pattern api
http://cr.openjdk.java.net/~sherman/7014640/Pattern.html

Here are couple notes regarding the spec/implementation.
(1) \v was implemented as \u000B ('\013'), but not documented (did notappear in our APIdoc as one supported construct, such as \t \r \n...). To define \v asa "general" construct forall vertical whitespace characters might trigger some compatibilityconcerns (more charactersare now matched by \v). But given this is a never documentedimplementation detail and the\u000B is still being matched by \v, I would consider this as anacceptable behavior change.
(2) a predefined character class can appear inside another characterclass, for exampleyou can have [...\v...], however, since it represents a "class" ofcharacter, so it can't bea start or end code point of a range inside a class, so you can have[a-b], but you can'thave [\h-...] or [...-\h] (exception will be thrown). But for \v,since it was implementedas \u000B (VT), you were able to put it as a start or end value of arange, I feel it'd bebetter still keep it the way it worked before, so [\v-\v] works andwill match the VT in
this implementation.
(3) The newly added \h\v\H\V constructs are all "Unicode version" ofcharacter classes, therest of the "predefined character class" family (\d\D\s\S\w\W) areASCII only, you will have toturn on flag UNICODE_CHARACTER_CLASS to get the Unicode version ofthese constructs. Thisis kinda of inconsistent. Perl's corresponding constructs work in asimilar way, all Perl's \d\D\s\S\w\W\v\V\h\H work in Unicode version, and to have a \a modifier toturn the \d\D\s\S\w\Wback to ASCII mode but not for \h\v\H\V. We had the discussion backinto JDK7 regarding theASCII vs Unicode for these constructs, the decision then was to keepthese predefined characterclasses (and POSIX character classes) ASCII by default, to have a flagUNICODE_CHARACTER_CLASSto turn them into Unicode version. Given there is NOT an ASCII versionin Perl and we didn'thave ASCII version in Java regex to trigger compatibility concern, Ifeel it might be better to
just have a simple Unicode version of \h\v\H\V.

(4)\R is not a character class, since it matched \r\n.

This one will need to go through ccc process.

Thanks,
-Sherman

Re: Codereview request for 7014640: To add a metachar \R for line ending and character classes for vertical/horizontal ws \v \V \h \H

Reply via email to