Providing script support is obvious and non-controversial, because other regex programming environments provide it. Check that the behavior and syntax of the extension is consistent with e.g. ICU, python, and especially perl (5.12 just released!)
http://perldoc.perl.org/perlunicode.html I would add some documentation to the three special script values; their meaning is not obvious. For implementation, the character matching problem is in general equivalent to the problem of compiling a switch statement, which is known to be non-trivial. Guava contains a CharMatcher class that tries to solve related problems. http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/base/CharMatcher.html I'm thinking scripts and blocks should know about which ranges they contain. In particular, \p{BlockName} should not need binary search at regex compile time or runtime. --- There is one place you need to change key word => keyword --- InMongolian => {...@code InMongolian} --- I notice current Unicode block support in JDK is not updated to the latest standard. E.g. Samaritan is missing. Martin On Thu, Apr 22, 2010 at 01:01, Xueming Shen <xueming.s...@oracle.com> wrote: > Hi, > > Here is the webrev of the proposal to add Unicode script support in regex > and j.l.Character. > > http://cr.openjdk.java.net/~sherman/script/webrev > > and the corresponding blenderrev > > http://cr.openjdk.java.net/~sherman/script/blenderrev.html > > Please comment on the APIs before I submit the CCC, especially > > (1) to use enum for the j.l.Character.UnicodeScript (compared to the > traditional j.l.c.Subset) > (2) the piggyback method j.l.c.getName() :-) > (3) the syntax for script constructs. In addition to the "normal" > \p{InScriptName} and \P{InScriptName} for the script support > I'm also adding > \p{script=ScriptName} \P{script=ScriptName} for the new script support > \p{block=BlockName} \P{block=BlockName} for the "existing" block support > \p{general_category=CategoryName} \P{general_category=CategoryName} for > the "existing" gc > Perl recently also started to accept this \p{propName=propValue} Unicode > style. > It opens the door for future "expanding", for example \p{name=XYZ} :-) > (4)and of course, the wording. > > Thanks, > Sherman > > >