[
https://issues.apache.org/jira/browse/OPENNLP-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177556#comment-14177556
]
Joern Kottmann commented on OPENNLP-711:
----------------------------------------
The position which is calculated in the case useTokenEnd false should be the
start index of the next sentence. The start index is the index of the first
char in the next sentence.
The above code sets the eos char as the start index of the next sentence. We
should apply the proposed fix and add one to cint to handle the case of
useTokenEnd false correctly.
> SentenceDetectorME::sentPosDetect() with useTokenEnd=false
> ----------------------------------------------------------
>
> Key: OPENNLP-711
> URL: https://issues.apache.org/jira/browse/OPENNLP-711
> Project: OpenNLP
> Issue Type: Bug
> Components: Sentence Detector
> Affects Versions: 1.6.0
> Reporter: Eugen Hanussek
> Priority: Minor
> Fix For: 1.6.0
>
>
> I trained the SentenceModel with a german korpus and wondered about the
> results for the following input (a mark indicates the expected split):
> {code:xml}
> "I am hungry.Ich bin Mr. Bean.Ein guter Satz."
> ^ ^
> {code}
> The result was 3 sentences. Good, but the split was not at the eosChar. It
> was after the token with the eosChar: "I am hungry.Ich" , "bin Mr. Bean.Ein",
> ...
> After some debugging I found out that I have to set useTokenEnd=false in the
> SentenceDetectorFactory-ctor.
> And then I found a *little bug in SentenceDetectorME* when the span is
> calculated:
> {code:java}
> public Span[] sentPosDetect(String s) {
> ...
> if (bestOutcome.equals(SPLIT) && isAcceptableBreak(s, index, cint)) {
> if (index != cint) {
> if (useTokenEnd) {
> positions.add(getFirstNonWS(s, getFirstWS(s,cint + 1)));
> }
> else {
> positions.add(getFirstNonWS(s,cint)); // this should be
> positions.add(getFirstNonWS(s,cint + 1));
> }
> sentProbs.add(probs[model.getIndex(bestOutcome)]);
> }
> index = cint + 1;
> }
> ...
> {code}
> This change has only impact on models with useTokenEnd=false
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)