SRX Sentence Tokenizer

Andriy Rysin Wed, 01 May 2013 09:18:52 -0700

Hi all

I need a bit help with srx sentence tokenizer, I've added this rule to
prevent sentence split on Name abbreviation+Surname, e.g. "Т.Шевченко"
which is often met in texts.
The rule will need to be a bit more complex but I am trying something
simple first.


<rule break="no">
<beforebreak>\b[А-ЯІЇЄҐ]\.[А-ЯІЇЄҐ]</beforebreak>
<afterbreak></afterbreak>
</rule>

But my test in UkrainianSRXSentenceTokenizerTest.java fails (it's currently
commented out in svn):

    testSplit("Наша зустріч з А.Марчуком відбулася в грудні минулого
року.");

I tried to spin the regex a bit but nothing helps. I've added couple of
other rules and they worked ok.

Any help would be greately appreciated.

Thanks
Andriy

P.S. BTW would not it make sense to split segement.srx by language modules?

------------------------------------------------------------------------------
Introducing AppDynamics Lite, a free troubleshooting tool for Java/.NET
Get 100% visibility into your production application - at no cost.
Code-level diagnostics for performance bottlenecks with <2% overhead
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap1

_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

SRX Sentence Tokenizer

Reply via email to