[
https://issues.apache.org/jira/browse/OPENNLP-203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053204#comment-13053204
]
Jörn Kottmann commented on OPENNLP-203:
---------------------------------------
This issue is linked to the usage of the "useTokenEnd" option, if it is false
the code which computes the span makes the above described off by one error.
For now I suggest the UIMA Sentence Detector Trainer uses the same default as
the cmd line version. Beside that we should fix the issue in the Sentence
Detector ME code.
> UIMA Sentence Detector Trainer builds models which do not split correctly the
> sentences
> ---------------------------------------------------------------------------------------
>
> Key: OPENNLP-203
> URL: https://issues.apache.org/jira/browse/OPENNLP-203
> Project: OpenNLP
> Issue Type: Bug
> Components: Sentence Detector, UIMA Integration
> Affects Versions: tools-1.5.1-incubating
> Environment: OS
> Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 (Ubuntu
> 4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011
> JVM
> java version "1.6.0_17"
> Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
> Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)
> Reporter: Nicolas Hernandez
> Fix For: tools-1.5.2-incubating
>
>
> The models trained with the UIMA component give wrong begin/end offset
> despite the fact they manage to split text in sentences.
> I observed that the begin of a current sentence starts including as a first
> token the punctuation character of the previous one while the
> previous one does not include it as its last one.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira