On Wed, May 13, 2015 at 05:16:10PM +0900, NOKUBI Takatsugu wrote:
> On Wed, 13 May 2015 09:12:14 +0200
> Daniel Naber <daniel.na...@languagetool.org> wrote:
> 
> > <token regexp="yes">[\u3040-\u309F]+</token>
> > <token>X</token>
> 
> There is some exepction, like a long consontants character (っ) but it is
> not bad to ignore because some slang is breaking the rule.

Ah, I did not know that the Java regex library allows to match Unicode
code points. Looking at the documentation of the library here

http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

and

http://unicode.org/Public/UNIDATA/Scripts.txt

the following, using Unicode categories, should work as well. 

                                <token regexp="yes">\p{IsHan}+</token>
                                <token >ー</token>

Presumably that does not include っ either.


Cheers,

Silvan


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to