Re: OpenNLP Sentence Detector: EOS Characters

Katrin Tomanek Thu, 09 Feb 2012 03:10:53 -0800

Hi Jens,

not sure how this could be done, because -- as you mentioned -- thetraining format is such that each sentence is in one line.


Do you have a concrete suggestion how to do it?

In our company-internal UIMA wrappers we did something similar as youprobably did: consider each line break as sentence boundary(additionally to what opennlp tagger says). This, however, is not alwaysa good idea, depending on you document input.


Best
Katrin

On 02/09/2012 11:41 AM, Jens Grivolla wrote:

Hi,

On 02/08/2012 05:52 PM, Katrin Tomanek wrote:

[...] I realized that only these EOS (end of sentence)
characters are currently supported:

'.', '!', '?'

However, in our case we have many other EOS (":" as one of the most
common ones)


I believe our situation is even worse, because we want to have line
breaks as possible EOS. We use OpenNLP through UIMA where this should
not be an issue, but I understand that the algorithms are designed to
work with training files that use line breaks to represent sentence
boundaries, i.e. line breaks are used as a meta character that can not
actually occur within the document.

When introducing configurability of EOS characters it would be good to
take that into account and provide a way to deal with line breaks in the
documents.

Jens



--
Dr. Katrin Tomanek
Averbis GmbH
Tennenbacher Strasse 11
D-79106 Freiburg

Fon: +49 (0) 761 - 203 97696
Fax: +49 (0) 761 - 203 97694
E-Mail: katrin.toma...@averbis.com

Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
Sitz der Gesellschaft: Freiburg i. Br.
AG Freiburg i. Br., HRB 701080

Re: OpenNLP Sentence Detector: EOS Characters

Reply via email to