Martin Buchholz wrote:
Providing script support is obvious and non-controversial,
because other regex programming environments provide it.
Check that the behavior and syntax of the extension is
consistent with e.g. ICU, python, and especially perl
(5.12 just released!)
http://perldoc.perl.org/perlunicode.html
\p{propName=propValue} is the unicode "compound form", which is supported in
perl 5.12. It also has a variant type \p{propName:propValue}. It was in
my proposal,
but I removed it the last minutes. Two forms (\p{In/IsProp} and
\p{propName=propValue}
should be good enough for now. Three is a little too much. We can always
add it
in later, if desirable.
\p{IsScript}, \p{Isgc}, \p{InBlock} are perl compatible as well.
I would add some documentation to the three special script values;
their meaning is not obvious.
I think it might be better to justt leave the detailed explain doc to
the TR#24. The "script"
here in j.l.Character serves only the purpose of id, the API here should
not be the place
to explain "what they really are".
For implementation, the character matching problem is in general
equivalent to the problem of compiling a switch statement, which is
known to be non-trivial. Guava contains a CharMatcher class that
tries to solve related problems.
http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/base/CharMatcher.html
I'm thinking scripts and blocks should know about which ranges they contain.
In particular, \p{BlockName} should not need binary search at
regex compile time or runtime.
It definitely is desirable if we can avoid the binary-search lookup
during at least the runtime. The
cost will be to keep a separate/redundant block/script->ranges table in
regex itself.
---
There is one place you need to change
key word => keyword
---
InMongolian => {...@code InMongolian}
---
Good catch, thanks!
I notice current Unicode block support in JDK is not updated to the
latest standard.
E.g. Samaritan is missing.
The Character class has not been updated to the latest 5.20 yet. Yuka
has a CCC pending for
that. My script data is from the 5.20.
Martin
On Thu, Apr 22, 2010 at 01:01, Xueming Shen <xueming.s...@oracle.com> wrote:
Hi,
Here is the webrev of the proposal to add Unicode script support in regex
and j.l.Character.
http://cr.openjdk.java.net/~sherman/script/webrev
and the corresponding blenderrev
http://cr.openjdk.java.net/~sherman/script/blenderrev.html
Please comment on the APIs before I submit the CCC, especially
(1) to use enum for the j.l.Character.UnicodeScript (compared to the
traditional j.l.c.Subset)
(2) the piggyback method j.l.c.getName() :-)
(3) the syntax for script constructs. In addition to the "normal"
\p{InScriptName} and \P{InScriptName} for the script support
I'm also adding
\p{script=ScriptName} \P{script=ScriptName} for the new script support
\p{block=BlockName} \P{block=BlockName} for the "existing" block support
\p{general_category=CategoryName} \P{general_category=CategoryName} for
the "existing" gc
Perl recently also started to accept this \p{propName=propValue} Unicode
style.
It opens the door for future "expanding", for example \p{name=XYZ} :-)
(4)and of course, the wording.
Thanks,
Sherman