On Wed, May 13, 2015 at 05:16:10PM +0900, NOKUBI Takatsugu wrote: > On Wed, 13 May 2015 09:12:14 +0200 > Daniel Naber <daniel.na...@languagetool.org> wrote: > > > <token regexp="yes">[\u3040-\u309F]+</token> > > <token>X</token> > > There is some exepction, like a long consontants character (っ) but it is > not bad to ignore because some slang is breaking the rule.
Ah, I did not know that the Java regex library allows to match Unicode code points. Looking at the documentation of the library here http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html and http://unicode.org/Public/UNIDATA/Scripts.txt the following, using Unicode categories, should work as well. <token regexp="yes">\p{IsHan}+</token> <token >ー</token> Presumably that does not include っ either. Cheers, Silvan ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel