Tim, is the training data something you can share publicly? Or privately? I can't publicly share the data that has been used to train the sentence detector, I can only share the models that get built. And you can't build a model from an existing model + more data, you need all the training data together.
Regarding how quickly we can get this out there, I can train a new sentence detector in a day or two. But that's just the first step - to really incorporate this, I would suggest this be a point release. We would need a release manager for that. Right now I don't have time for that. I haven't heard a consensus saying whether this should be the new behavior. >From what I remember we are going to need code changes to make optional the >code that splits at line breaks, or was your test replacing the existing >cTAKES sentence detector and just using OpenNLP directly. -- James -----Original Message----- From: Tim Miller [mailto:[email protected]] Sent: Monday, January 27, 2014 8:52 AM To: [email protected] Subject: Re: sentence detector newline behavior OK, with the most recent version I am able to replicate the performance I was getting before. Thanks a lot Jörn! Assuming this is in the next incremental release of opennlp, how quickly can we get a re-trained model into cTAKES? I heard from a researcher at AMIA who tried cTAKES and because of this bug in the way we handle sentences was trying to find an outside sentence detector as a preprocess to cTAKES, and frankly that is insane. We should be able to get something this simple right. And I think this is the kind of thing that can leave new users scratching their heads and doubting our overall competence. James, I believe you are usually the one who rebuilds the models? What would be the best way to incorporate the data I have that has some instances of non-sentence terminating newlines? Tim On 01/27/2014 06:10 AM, Jörn Kottmann wrote: > On 01/26/2014 11:29 PM, Miller, Timothy wrote: >> Yes, this fixes the whitespace sentence issue but the evaluation issue >> remains. I believe the problem is in SentenceSampleStream, where in the >> following block the whitespace trim happens before the <LF> character is >> replaced with the \n character. So test sentences that ended with <LF> >> will be one character longer than they should be. >> >>> > sentence = sentence.trim(); >>> > sentence = replaceNewLineEscapeTags(sentence); >>> > sentencesString.append(sentence); >>> > int end = sentencesString.length(); >>> > sentenceSpans.add(new Span(begin, end)); >>> > sentencesString.append(' '); > > Yes, that must be the issue. During training the new line is inlucded > in the span, and during > detection the white space remover creates a span without the new line > char. > > I suggest that the evaluator just ignores white space differences > between sentences. My test case then > has the expected performance numbers. > > What do you think? > > Anyway, I committed the change. Please give it a try. > > Jörn
