[jira] [Commented] (OPENNLP-711) SentenceDetectorME::sentPosDetect() with useTokenEnd=false

Joern Kottmann (JIRA) Mon, 20 Oct 2014 15:09:07 -0700

    [ 
https://issues.apache.org/jira/browse/OPENNLP-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177556#comment-14177556
 ]


Joern Kottmann commented on OPENNLP-711:
----------------------------------------

The position which is calculated in the case useTokenEnd false should be the 
start index of the next sentence. The start index is the index of the first 
char in the next sentence.

The above code sets the eos char as the start index of the next sentence. We 
should apply the proposed fix and add one to cint to handle the case of 
useTokenEnd false correctly.



> SentenceDetectorME::sentPosDetect() with useTokenEnd=false
> ----------------------------------------------------------
>
>                 Key: OPENNLP-711
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-711
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Sentence Detector
>    Affects Versions: 1.6.0
>            Reporter: Eugen Hanussek
>            Priority: Minor
>             Fix For: 1.6.0
>
>
> I trained the SentenceModel with a german korpus and wondered about the 
> results for the following input (a mark indicates the expected split):
> {code:xml}
> "I am hungry.Ich bin Mr. Bean.Ein guter Satz."
>              ^                ^
> {code}
> The result was 3 sentences. Good, but the split was not at the eosChar. It 
> was after the token with the eosChar: "I am hungry.Ich" , "bin Mr. Bean.Ein", 
> ...
> After some debugging I found out that I have to set useTokenEnd=false in the 
> SentenceDetectorFactory-ctor.
> And then I found a *little bug in SentenceDetectorME* when the span is 
> calculated:
> {code:java}
>   public Span[] sentPosDetect(String s) {
> ...
>       if (bestOutcome.equals(SPLIT) && isAcceptableBreak(s, index, cint)) {
>         if (index != cint) {
>           if (useTokenEnd) {
>             positions.add(getFirstNonWS(s, getFirstWS(s,cint + 1)));
>           }
>           else {
>             positions.add(getFirstNonWS(s,cint)); // this should be 
> positions.add(getFirstNonWS(s,cint + 1)); 
>           }
>           sentProbs.add(probs[model.getIndex(bestOutcome)]);
>         }
>         index = cint + 1;
>       }
> ...
> {code}
> This change has only impact on models with useTokenEnd=false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OPENNLP-711) SentenceDetectorME::sentPosDetect() with useTokenEnd=false

Reply via email to