Can I assume we are all OK with at least the API part of the latest webrev/blenderrev of the script support in j.l.Character and j.u.r.Pattern, including the j.l.Chareacter.getName().

http://cr.openjdk.java.net/~sherman/script/blenderrev.html
http://cr.openjdk.java.net/~sherman/script/webrev

Okutsu-san, Yuka, can one of you help review the corresponding CCC at
http://ccc.sfbay.sun.com/6945564?

This is for the j.l.Character part only. I'm still trying to figure out how to take over the ownership of 4860714 in CCC system, we have a placeholder for this one in
CCC back to 2003.

Thanks,
-Sherman




Xueming Shen wrote:
Martin Buchholz wrote:
Providing script support is obvious and non-controversial,
because other regex programming environments provide it.
Check that the behavior and syntax of the extension is
consistent with e.g. ICU, python, and especially perl
(5.12 just released!)

http://perldoc.perl.org/perlunicode.html

\p{propName=propValue} is the unicode "compound form", which is supported in perl 5.12. It also has a variant type \p{propName:propValue}. It was in my proposal, but I removed it the last minutes. Two forms (\p{In/IsProp} and \p{propName=propValue} should be good enough for now. Three is a little too much. We can always add it
in later, if desirable.

\p{IsScript}, \p{Isgc}, \p{InBlock} are perl compatible as well.

I would add some documentation to the three special script values;
their meaning is not obvious.

I think it might be better to justt leave the detailed explain doc to the TR#24. The "script" here in j.l.Character serves only the purpose of id, the API here should not be the place
to explain "what they really are".

For implementation, the character matching problem is in general
equivalent to the problem of compiling a switch statement, which is
known to be non-trivial.  Guava contains a CharMatcher class that
tries to solve related problems.

http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/base/CharMatcher.html

I'm thinking scripts and blocks should know about which ranges they contain.
In particular, \p{BlockName} should not need binary search at
regex compile time or runtime.
It definitely is desirable if we can avoid the binary-search lookup during at least the runtime. The cost will be to keep a separate/redundant block/script->ranges table in regex itself.

---
There is one place you need to change
key word => keyword
---
InMongolian => {...@code InMongolian}
---

Good catch, thanks!

I notice current Unicode block support in JDK is not updated to the
latest standard.
E.g. Samaritan is missing.

The Character class has not been updated to the latest 5.20 yet. Yuka has a CCC pending for
that. My script data is from the 5.20.


Martin

On Thu, Apr 22, 2010 at 01:01, Xueming Shen <xueming.s...@oracle.com> wrote:
Hi,

Here is the webrev of the proposal to add Unicode script support in regex
and j.l.Character.

http://cr.openjdk.java.net/~sherman/script/webrev

and the corresponding blenderrev

http://cr.openjdk.java.net/~sherman/script/blenderrev.html

Please comment on the APIs before I submit the CCC, especially

(1) to use enum for the j.l.Character.UnicodeScript (compared to the
traditional j.l.c.Subset)
(2) the piggyback method j.l.c.getName() :-)
(3) the syntax for script constructs. In addition to the "normal"
   \p{InScriptName} and \P{InScriptName} for the script support
   I'm also adding
\p{script=ScriptName} \P{script=ScriptName} for the new script support \p{block=BlockName} \P{block=BlockName} for the "existing" block support \p{general_category=CategoryName} \P{general_category=CategoryName} for
the "existing" gc
Perl recently also started to accept this \p{propName=propValue} Unicode
style.
It opens the door for future "expanding", for example \p{name=XYZ} :-)
(4)and of course, the wording.

Thanks,
Sherman






Reply via email to