Re: Good way to handle invalid prolonged sound mark

Daniel Naber Wed, 13 May 2015 00:12:50 -0700

On 2015-05-13 07:43, Takatsugu Nokubi wrote:

> "ー" (prolonged sound mark) is a popular symbol in Japanese.
> And the rule itself is simple:
> 
> The symbol is placed after Hiragana or Katakana, not Kanji.


If the scripts (Hiragana, Katakana, Kanji) have non-overlapping Unicode 
ranges, it should be possible to use a regular expression like this:

Hiranga: [\u3040-\u309F]
not Hiranga: [^\u3040-\u309F]

So if you want to find character "X" after Hiranga you could try this:

<token regexp="yes">.*[\u3040-\u309F]X</token>

Or maybe, depending on tokenization:

<token regexp="yes">[\u3040-\u309F]+</token>
<token>X</token>

Regards
  Daniel


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: Good way to handle invalid prolonged sound mark

Reply via email to