Re: Unicode script support in Regex and Character class

Xueming Shen Sun, 25 Apr 2010 22:33:49 -0700

Can I assume we are all OK with at least the API part of the latestwebrev/blenderrev ofthe script support in j.l.Character and j.u.r.Pattern, including thej.l.Chareacter.getName().


http://cr.openjdk.java.net/~sherman/script/blenderrev.html
http://cr.openjdk.java.net/~sherman/script/webrev

Okutsu-san, Yuka, can one of you help review the corresponding CCC at
http://ccc.sfbay.sun.com/6945564?

This is for the j.l.Character part only. I'm still trying to figure outhow to take overthe ownership of 4860714 in CCC system, we have a placeholder for thisone in

CCC back to 2003.

Thanks,
-Sherman




Xueming Shen wrote:

Martin Buchholz wrote:
Providing script support is obvious and non-controversial,
because other regex programming environments provide it.
Check that the behavior and syntax of the extension is
consistent with e.g. ICU, python, and especially perl
(5.12 just released!)

http://perldoc.perl.org/perlunicode.html
\p{propName=propValue} is the unicode "compound form", which issupported inperl 5.12. It also has a variant type \p{propName:propValue}. It wasin my proposal,but I removed it the last minutes. Two forms (\p{In/IsProp} and\p{propName=propValue}should be good enough for now. Three is a little too much. We canalways add it
in later, if desirable.

\p{IsScript}, \p{Isgc}, \p{InBlock} are perl compatible as well.
I would add some documentation to the three special script values;
their meaning is not obvious.
I think it might be better to justt leave the detailed explain doc tothe TR#24. The "script"here in j.l.Character serves only the purpose of id, the API hereshould not be the place
to explain "what they really are".
For implementation, the character matching problem is in general
equivalent to the problem of compiling a switch statement, which is
known to be non-trivial.  Guava contains a CharMatcher class that
tries to solve related problems.
http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/base/CharMatcher.html
I'm thinking scripts and blocks should know about which ranges theycontain.
In particular, \p{BlockName} should not need binary search at
regex compile time or runtime.
It definitely is desirable if we can avoid the binary-search lookupduring at least the runtime. Thecost will be to keep a separate/redundant block/script->ranges tablein regex itself.
---
There is one place you need to change
key word => keyword
---
InMongolian => {...@code InMongolian}
---
Good catch, thanks!
I notice current Unicode block support in JDK is not updated to the
latest standard.
E.g. Samaritan is missing.
The Character class has not been updated to the latest 5.20 yet. Yukahas a CCC pending for
that. My script data is from the 5.20.
Martin
On Thu, Apr 22, 2010 at 01:01, Xueming Shen <[email protected]>wrote:
Hi,
Here is the webrev of the proposal to add Unicode script support inregex
and j.l.Character.

http://cr.openjdk.java.net/~sherman/script/webrev

and the corresponding blenderrev

http://cr.openjdk.java.net/~sherman/script/blenderrev.html

Please comment on the APIs before I submit the CCC, especially

(1) to use enum for the j.l.Character.UnicodeScript (compared to the
traditional j.l.c.Subset)
(2) the piggyback method j.l.c.getName() :-)
(3) the syntax for script constructs. In addition to the "normal"
   \p{InScriptName} and \P{InScriptName} for the script support
   I'm also adding
\p{script=ScriptName} \P{script=ScriptName} for the new scriptsupport\p{block=BlockName} \P{block=BlockName} for the "existing" blocksupport\p{general_category=CategoryName}\P{general_category=CategoryName} for
the "existing" gc
Perl recently also started to accept this \p{propName=propValue}Unicode
style.
It opens the door for future "expanding", for example \p{name=XYZ}:-)
(4)and of course, the wording.

Thanks,
Sherman

Re: Unicode script support in Regex and Character class

Reply via email to