Tim, is the training data something you can share publicly? Or privately?  I 
can't publicly share the data that has been used to train the sentence 
detector, I can only share the models that get built. And you can't build a 
model from an existing model + more data, you need all the training data 
together.

Regarding how quickly we can get this out there, I can train a new sentence 
detector in a day or two. But that's just the first step - to really 
incorporate this, I would suggest this be a point release.   We would need a 
release manager for that.  Right now I don't have time for that.  I haven't 
heard a consensus saying whether this should be the new behavior. 

>From what I remember we are going to need code changes to make optional the 
>code that splits at line breaks, or was your test replacing the existing 
>cTAKES sentence detector and just using OpenNLP directly.

-- James

-----Original Message-----
From: Tim Miller [mailto:[email protected]] 
Sent: Monday, January 27, 2014 8:52 AM
To: [email protected]
Subject: Re: sentence detector newline behavior

OK, with the most recent version I am able to replicate the performance 
I was getting before. Thanks a lot Jörn!

Assuming this is in the next incremental release of opennlp, how quickly 
can we get a re-trained model into cTAKES? I heard from a researcher at 
AMIA who tried cTAKES and because of this bug in the way we handle 
sentences was trying to find an outside sentence detector as a 
preprocess to cTAKES, and frankly that is insane. We should be able to 
get something this simple right. And I think this is the kind of thing 
that can leave new users scratching their heads and doubting our overall 
competence.

James, I believe you are usually the one who rebuilds the models? What 
would be the best way to incorporate the data I have that has some 
instances of non-sentence terminating newlines?

Tim


On 01/27/2014 06:10 AM, Jörn Kottmann wrote:
> On 01/26/2014 11:29 PM, Miller, Timothy wrote:
>> Yes, this fixes the whitespace sentence issue but the evaluation issue
>> remains. I believe the problem is in SentenceSampleStream, where in the
>> following block the whitespace trim happens before the <LF> character is
>> replaced with the \n character. So test sentences that ended with <LF>
>> will be one character longer than they should be.
>>
>>> >       sentence = sentence.trim();
>>> >       sentence = replaceNewLineEscapeTags(sentence);
>>> >       sentencesString.append(sentence);
>>> >       int end = sentencesString.length();
>>> >       sentenceSpans.add(new Span(begin, end));
>>> >       sentencesString.append(' ');
>
> Yes, that must be the issue. During training the new line is inlucded 
> in the span, and during
> detection the white space remover creates a span without the new line 
> char.
>
> I suggest that the evaluator just ignores white space differences 
> between sentences. My test case then
> has the expected performance numbers.
>
> What do you think?
>
> Anyway, I committed the change. Please give it a try.
>
> Jörn

Reply via email to