[jira] [Commented] (OPENNLP-203) UIMA Sentence Detector Trainer builds models which do not split correctly the sentences

JIRA Fri, 24 Jun 2011 02:29:15 -0700

    [ 
https://issues.apache.org/jira/browse/OPENNLP-203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054325#comment-13054325
 ]


Jörn Kottmann commented on OPENNLP-203:
---------------------------------------

You might encounter one more issue. The sentence detector labels each potential 
end of sentence character as either a sentence-end or no-sentence-end. Based on 
your input file such samples are generated for training. In the input file each 
sentence is written in a line, and the sample generation code assumes that the 
last end of sentence character in the line is the true sentence-end.

In your europarl file there are lines which do not end with a end sentence 
character but might contain tokens with end of sentence characters.
For example:

Dr. Smith said: <- In this sample the dot in Dr. would be mistaken for a 
sentence end.

> UIMA Sentence Detector Trainer builds models which do not split correctly the 
> sentences
> ---------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-203
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-203
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Sentence Detector, UIMA Integration
>    Affects Versions: tools-1.5.1-incubating
>         Environment: OS
> Linux version 2.6.32-30-generic (buildd@vernadsky) (gcc version 4.4.3 (Ubuntu 
> 4.4.3-4ubuntu5) ) #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011
> JVM
> java version "1.6.0_17"
> Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
> Java HotSpot(TM) Server VM (build 14.3-b01, mixed mode)
>            Reporter: Nicolas Hernandez
>             Fix For: tools-1.5.2-incubating
>
>
> The models trained with the UIMA component give wrong begin/end offset 
> despite the fact they manage to split text in sentences. 
> I observed that the begin of a current sentence starts including as a first 
> token the punctuation character of the previous one while the
> previous one does not include it as its last one. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (OPENNLP-203) UIMA Sentence Detector Trainer builds models which do not split correctly the sentences

Reply via email to