Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

Xueming Shen Sat, 23 Apr 2011 01:12:39 -0700

 The flag this request proposed to add is

 UNICODE_CHARSET


not the "UNICODE_UNICODE" in last email.

My apology for the typo.

Any suggestion for a better name? It was UNICODE_CHARACTERCLASS, but then it
became UNICODE_CHARSET, considering the unicode_case.

-Sherman

On 4/23/2011 1:00 AM, Xueming Shen wrote:

 Hi

This proposal tries to address
(1) j.u.regex does not meet Unicode regex's Simple Word Boundaries[1] requirement as Tom pointed
out in his email on i18n-dev list [2]. Basically we have 3 problems here.
a. ju.regex word boundary construct \b and \B uses Unicode\p{letter} + \p{digit} as the "word"definition when the standard requires the true Unicode\p{Alphabetic} property be used instead.
        It also neglects two of the specifically required characters:
        U+200C ZERO WIDTH NON-JOINER
        U+200D ZERO WIDTH JOINER
(or the "word" could be \p{alphabetic} + \p{gc=Mark} +\p{digit + \p{gc=Connector_Punctuation}, if
        follow Annex C).
    b. j.u.regex's word construct \w and \W are ASCII only version
c. It breaks the historical connection between word characters andword boundaries (because ofa) and b). For example "élève" is NOT matched by the \b\w+\bpattern)
(2) j.u.regex does not meet Unicode regex's Properties requirement[3][5][6][7]. Th main issues are
    a. Alphabetic: totally missing from the platform, not only regex
b. Lowercase, Uppercase and White_Space: Java implementation (via\p{javaMethod} is different
        compared to Unicode Standard definition.
c. j.u.regex's POSIX character classes are ASCII only, whenstandard has an Unicode version defined
        at tr#18 Annex C [3]

As the solution, I propose to

(1) add a flag UNICODE_UNICODE to
a) flip the ASCII only predefined character classes (\b \B \w \W\d \D \s \S) and POSIX character
        classes (\p{alpha}, \p{lower}, \{upper}...) to Unicode version
    b) enable the UNICODE_CASE (anything Unicode)
While ideally we would like to just evolve/upgrade the Java regexfrom the aged "ascii-only"to unicode (maybe add a OLD_ASCII_ONLY_POSIX as a fallback:-)),like what Perl did. Butgiven the Java's "compatibility" spirit (and the performanceconcern as well), this is unlikely to
    happen.
(2) add \p{IsBinaryProperty} to explicitly support some importantUnicode binary properties, suchas \p{IsAlphabetic}, \p{IsIdeographic}, \p{IsPunctuation}...withthis j.u.regex can easily accesssome properties that are either not provided by j.l.Characterdirectly or j.l.Character has a
    different version (for example the White_Space).
(The missing alphabetic, different uppercase/lowercase issue hasbeen/is being addressed at
    Cr#7037261 [4], any reviewer?)

The webrev is at
http://cr.openjdk.java.net/~sherman/7039066/webrev/

The corresponding updated api j.u.regex.Pattern API doc is at
http://cr.openjdk.java.net/~sherman/7039066/Pattern.html

Specdiff result is at
http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html
I will file the CCC request if the API change proposal in webrev isapproved. This is coming in very lateso it is possible that it may be held back until Java 8, if it can notmake the cutoff for jdk7.
-Sherman


[1] http://www.unicode.org/reports/tr18/#Simple_Word_Boundaries
[2]http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000256.html
[3] http://www.unicode.org/reports/tr18/#Compatibility_Properties
[4]http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000370.html[5]http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000249.html[6]http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000253.html[7]http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000254.html

Re: Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

Reply via email to