The changes sound good. The flag UNICODE_CHARSET will be misleading, since all of Java uses the Unicode Charset (= encoding). How about:
UNICODE_SPEC or something that gives that flavor. Mark *— Il meglio è l’inimico del bene —* On Sat, Apr 23, 2011 at 01:12, Xueming Shen <xueming.s...@oracle.com> wrote: > The flag this request proposed to add is > > UNICODE_CHARSET > > not the "UNICODE_UNICODE" in last email. > > My apology for the typo. > > Any suggestion for a better name? It was UNICODE_CHARACTERCLASS, but then > it > became UNICODE_CHARSET, considering the unicode_case. > > -Sherman > > > On 4/23/2011 1:00 AM, Xueming Shen wrote: > >> Hi >> >> This proposal tries to address >> >> (1) j.u.regex does not meet Unicode regex's Simple Word Boundaries [1] >> requirement as Tom pointed >> out in his email on i18n-dev list [2]. Basically we have 3 problems here. >> >> a. ju.regex word boundary construct \b and \B uses Unicode \p{letter} + >> \p{digit} as the "word" >> definition when the standard requires the true Unicode >> \p{Alphabetic} property be used instead. >> It also neglects two of the specifically required characters: >> U+200C ZERO WIDTH NON-JOINER >> U+200D ZERO WIDTH JOINER >> (or the "word" could be \p{alphabetic} + \p{gc=Mark} + \p{digit + >> \p{gc=Connector_Punctuation}, if >> follow Annex C). >> b. j.u.regex's word construct \w and \W are ASCII only version >> c. It breaks the historical connection between word characters and word >> boundaries (because of >> a) and b). For example "élève" is NOT matched by the \b\w+\b >> pattern) >> >> (2) j.u.regex does not meet Unicode regex's Properties requirement >> [3][5][6][7]. Th main issues are >> >> a. Alphabetic: totally missing from the platform, not only regex >> b. Lowercase, Uppercase and White_Space: Java implementation (via >> \p{javaMethod} is different >> compared to Unicode Standard definition. >> c. j.u.regex's POSIX character classes are ASCII only, when standard >> has an Unicode version defined >> at tr#18 Annex C [3] >> >> As the solution, I propose to >> >> (1) add a flag UNICODE_UNICODE to >> a) flip the ASCII only predefined character classes (\b \B \w \W \d \D >> \s \S) and POSIX character >> classes (\p{alpha}, \p{lower}, \{upper}...) to Unicode version >> b) enable the UNICODE_CASE (anything Unicode) >> >> While ideally we would like to just evolve/upgrade the Java regex from >> the aged "ascii-only" >> to unicode (maybe add a OLD_ASCII_ONLY_POSIX as a fallback:-)), like >> what Perl did. But >> given the Java's "compatibility" spirit (and the performance concern as >> well), this is unlikely to >> happen. >> >> (2) add \p{IsBinaryProperty} to explicitly support some important Unicode >> binary properties, such >> as \p{IsAlphabetic}, \p{IsIdeographic}, \p{IsPunctuation}...with this >> j.u.regex can easily access >> some properties that are either not provided by j.l.Character directly >> or j.l.Character has a >> different version (for example the White_Space). >> (The missing alphabetic, different uppercase/lowercase issue has >> been/is being addressed at >> Cr#7037261 [4], any reviewer?) >> >> The webrev is at >> http://cr.openjdk.java.net/~sherman/7039066/webrev/ >> >> The corresponding updated api j.u.regex.Pattern API doc is at >> http://cr.openjdk.java.net/~sherman/7039066/Pattern.html >> >> Specdiff result is at >> http://cr.openjdk.java.net/~sherman/7039066/specdiff/diff.html >> >> I will file the CCC request if the API change proposal in webrev is >> approved. This is coming in very late >> so it is possible that it may be held back until Java 8, if it can not >> make the cutoff for jdk7. >> >> -Sherman >> >> >> [1] http://www.unicode.org/reports/tr18/#Simple_Word_Boundaries >> [2] >> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000256.html >> [3] http://www.unicode.org/reports/tr18/#Compatibility_Properties >> [4] >> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000370.html >> [5] >> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000249.html >> [6] >> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000253.html >> [7] >> http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/000254.html >> > >