[
https://issues.apache.org/jira/browse/OPENNLP-203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054318#comment-13054318
]
Nicolas Hernandez commented on OPENNLP-203:
-------------------------------------------
All right.
I confirm it works now.
I tested by using a sample of the europarl-v6 corpus [1].
cat europarl-v6.fr-en.fr | perl -ne "if (/[\.\?\!\:\;\'\"»…]$/g) { print;} "|
head -n 1000 > europarl-v6.fr-en.fr.1KSent
I used the Apache whitespace tokenizer then the OpenNLP UIMA
SentenceDetectorTrainer to build a model.
And I tested the model with the OpenNLP UIMA SentenceDetector.
[1] http://www.statmt.org/europarl/
> UIMA Sentence Detector Trainer builds models which do not split correctly the
> sentences
> ---------------------------------------------------------------------------------------
>
> Key: OPENNLP-203
> URL: https://issues.apache.org/jira/browse/OPENNLP-203
> Project: OpenNLP
> Issue Type: Bug
> Components: Sentence Detector, UIMA Integration
> Affects Versions: tools-1.5.1-incubating
> Environment: OS
> Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 (Ubuntu
> 4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011
> JVM
> java version "1.6.0_17"
> Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
> Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)
> Reporter: Nicolas Hernandez
> Fix For: tools-1.5.2-incubating
>
>
> The models trained with the UIMA component give wrong begin/end offset
> despite the fact they manage to split text in sentences.
> I observed that the begin of a current sentence starts including as a first
> token the punctuation character of the previous one while the
> previous one does not include it as its last one.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira