Re: Unicode script support in Regex and Character class

Xueming Shen Sat, 08 May 2010 14:52:00 -0700

Hi,

The API  proposals for Unicode script support below have been approved.


6945564: Unicode script support in Character class
6948903: Make Unicode scripts available for use in regular expressions

Here is the final webrev ready for push.

http://cr.openjdk.java.net/~sherman/6945564_6948903/webrev

(1) It is suggested that the access to the UnicodeScript andUnicodeBlock's ranges data mightbe desirable for certain use scenario, for example our regex enginemight benefit from suchaccess to avoid runime binary search for each/every matching operation.I'm considering toadd a pair of UnicodeScript.is(codePoint) & UnicdeBlock.is(codePoint) toaddress this issue,but prefer to handle it in a separate RFE (it seems like it's ano-brainer for UnicodeBlock, buttricky for the UncodeScript, given its wide ranges of lots scripts, anysuggestion? or

alternative?).

(2)Testing result suggests there is not too much runtime benefit ofkeeping a huge stringdata pool + an access hashmap for getName() implementation. The latestimplementation nowtakes Ulf's suggestion to keep a relatively small byte[] pool andgenerate the names at runtime.(there is "even smaller" implementation, which consumes about 300Kmemory at runtime

http://cr.openjdk.java.net/~sherman/script/webrev.00/

but it has a "scalability" problem need to address when string poolgrows beyond 64k and it

is little slow)

(3)The UnicodeScript implementation is built on Unicode 5.2 Script.txt.The rest of the Characterclass however is still using the previous version waiting for Yuka'sUnicode 5.2 RFE to get

back in.

(4)The previous webrev can be found athttp://cr.openjdk.java.net/~sherman/scripte


Please help review.

Thanks,
-Sherman

Re: Unicode script support in Regex and Character class

Reply via email to