The flag this request proposed to add is

 UNICODE_CHARSET

not the "UNICODE_UNICODE" in last email.

My apology for the typo.

Any suggestion for a better name? It was UNICODE_CHARACTERCLASS, but then it
became UNICODE_CHARSET, considering the unicode_case.

-Sherman

On 4/23/2011 1:00 AM, Xueming Shen wrote:
 Hi

This proposal tries to address

(1) j.u.regex does not meet Unicode regex's Simple Word Boundaries [1] requirement as Tom pointed
out in his email on i18n-dev list [2]. Basically we have 3 problems here.

a. ju.regex word boundary construct \b and \B uses Unicode \p{letter} + \p{digit} as the "word" definition when the standard requires the true Unicode \p{Alphabetic} property be used instead.
        It also neglects two of the specifically required characters:
        U+200C ZERO WIDTH NON-JOINER
        U+200D ZERO WIDTH JOINER
(or the "word" could be \p{alphabetic} + \p{gc=Mark} + \p{digit + \p{gc=Connector_Punctuation}, if
        follow Annex C).
    b. j.u.regex's word construct \w and \W are ASCII only version
c. It breaks the historical connection between word characters and word boundaries (because of a) and b). For example "élève" is NOT matched by the \b\w+\b pattern)

(2) j.u.regex does not meet Unicode regex's Properties requirement [3][5][6][7]. Th main issues are

    a. Alphabetic: totally missing from the platform, not only regex
b. Lowercase, Uppercase and White_Space: Java implementation (via \p{javaMethod} is different
        compared to Unicode Standard definition.
c. j.u.regex's POSIX character classes are ASCII only, when standard has an Unicode version defined
        at tr#18 Annex C [3]

As the solution, I propose to

(1) add a flag UNICODE_UNICODE to
a) flip the ASCII only predefined character classes (\b \B \w \W \d \D \s \S) and POSIX character
        classes (\p{alpha}, \p{lower}, \{upper}...) to Unicode version
    b) enable the UNICODE_CASE (anything Unicode)

While ideally we would like to just evolve/upgrade the Java regex from the aged "ascii-only" to unicode (maybe add a OLD_ASCII_ONLY_POSIX as a fallback:-)), like what Perl did. But given the Java's "compatibility" spirit (and the performance concern as well), this is unlikely to
    happen.

(2) add \p{IsBinaryProperty} to explicitly support some important Unicode binary properties, such as \p{IsAlphabetic}, \p{IsIdeographic}, \p{IsPunctuation}...with this j.u.regex can easily access some properties that are either not provided by j.l.Character directly or j.l.Character has a
    different version (for example the White_Space).
(The missing alphabetic, different uppercase/lowercase issue has been/is being addressed at
    Cr#7037261 [4], any reviewer?)

The webrev is at
http://cr.openjdk.java.net/~sherman/7039066/webrev/

The corresponding updated api j.u.regex.Pattern API doc is at
http://cr.openjdk.java.net/~sherman/7039066/Pattern.html

Specdiff result is at
http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html

I will file the CCC request if the API change proposal in webrev is approved. This is coming in very late so it is possible that it may be held back until Java 8, if it can not make the cutoff for jdk7.

-Sherman


[1] http://www.unicode.org/reports/tr18/#Simple_Word_Boundaries
[2] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000256.html
[3] http://www.unicode.org/reports/tr18/#Compatibility_Properties
[4] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000370.html [5] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000249.html [6] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000253.html [7] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000254.html

Reply via email to