[
https://issues.apache.org/jira/browse/OPENNLP-711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eugen Hanussek updated OPENNLP-711:
-----------------------------------
Description:
I trained the SentenceModel with a german korpus and wondered about the results
for the following input (a mark indicates the expected split):
{code:xml}
"I am hungry.Ich bin Mr. Bean.Ein guter Satz."
^ ^
{code}
The result was 3 sentences. Good, but the split was not at the eosChar. It was
after the token with the eosChar: "I am hungry.Ich" , "bin Mr. Bean.Ein", ...
After some debugging I found out that I have to set useTokenEnd=false in the
SentenceDetectorFactory-ctor.
And then I found a *little bug in SentenceDetectorME* when the span is
calculated:
{code:java}
public Span[] sentPosDetect(String s) {
...
if (bestOutcome.equals(SPLIT) && isAcceptableBreak(s, index, cint)) {
if (index != cint) {
if (useTokenEnd) {
positions.add(getFirstNonWS(s, getFirstWS(s,cint + 1)));
}
else {
positions.add(getFirstNonWS(s,cint)); // this should be
positions.add(getFirstNonWS(s,cint + 1));
}
sentProbs.add(probs[model.getIndex(bestOutcome)]);
}
index = cint + 1;
}
...
{code}
This change has only impact on models with useTokenEnd=false
was:
I trained the SentenceModel with a german korpus and wondered about the results
for the following input (a mark indicates the expected split):
{code:xml}
"I am hungry.Ich bin Mr. Bean.Ein guter Satz."
^ ^
{code}
The result was 3 sentences. Good, but the split was not at the eosChar. It was
after the token with the eosChar: "I am hungry.Ich" , "bin Mr. Bean.Ein", ...
After some debugging I found out that I have to set useTokenEnd=false.
And then I found a *little bug in SentenceDetectorME* when the span is
calculated:
{code:java}
public Span[] sentPosDetect(String s) {
...
if (bestOutcome.equals(SPLIT) && isAcceptableBreak(s, index, cint)) {
if (index != cint) {
if (useTokenEnd) {
positions.add(getFirstNonWS(s, getFirstWS(s,cint + 1)));
}
else {
positions.add(getFirstNonWS(s,cint)); // this should be
positions.add(getFirstNonWS(s,cint + 1));
}
sentProbs.add(probs[model.getIndex(bestOutcome)]);
}
index = cint + 1;
}
...
{code}
This change has only impact on models with useTokenEnd=false
> SentenceDetectorME::sentPosDetect() with useTokenEnd=false
> ----------------------------------------------------------
>
> Key: OPENNLP-711
> URL: https://issues.apache.org/jira/browse/OPENNLP-711
> Project: OpenNLP
> Issue Type: Bug
> Components: Sentence Detector
> Affects Versions: 1.6.0
> Reporter: Eugen Hanussek
> Priority: Minor
>
> I trained the SentenceModel with a german korpus and wondered about the
> results for the following input (a mark indicates the expected split):
> {code:xml}
> "I am hungry.Ich bin Mr. Bean.Ein guter Satz."
> ^ ^
> {code}
> The result was 3 sentences. Good, but the split was not at the eosChar. It
> was after the token with the eosChar: "I am hungry.Ich" , "bin Mr. Bean.Ein",
> ...
> After some debugging I found out that I have to set useTokenEnd=false in the
> SentenceDetectorFactory-ctor.
> And then I found a *little bug in SentenceDetectorME* when the span is
> calculated:
> {code:java}
> public Span[] sentPosDetect(String s) {
> ...
> if (bestOutcome.equals(SPLIT) && isAcceptableBreak(s, index, cint)) {
> if (index != cint) {
> if (useTokenEnd) {
> positions.add(getFirstNonWS(s, getFirstWS(s,cint + 1)));
> }
> else {
> positions.add(getFirstNonWS(s,cint)); // this should be
> positions.add(getFirstNonWS(s,cint + 1));
> }
> sentProbs.add(probs[model.getIndex(bestOutcome)]);
> }
> index = cint + 1;
> }
> ...
> {code}
> This change has only impact on models with useTokenEnd=false
--
This message was sent by Atlassian JIRA
(v6.2#6252)