Hi Jens,
not sure how this could be done, because -- as you mentioned -- the
training format is such that each sentence is in one line.
Do you have a concrete suggestion how to do it?
In our company-internal UIMA wrappers we did something similar as you
probably did: consider each line break as sentence boundary
(additionally to what opennlp tagger says). This, however, is not always
a good idea, depending on you document input.
Best
Katrin
On 02/09/2012 11:41 AM, Jens Grivolla wrote:
Hi,
On 02/08/2012 05:52 PM, Katrin Tomanek wrote:
[...] I realized that only these EOS (end of sentence)
characters are currently supported:
'.', '!', '?'
However, in our case we have many other EOS (":" as one of the most
common ones)
I believe our situation is even worse, because we want to have line
breaks as possible EOS. We use OpenNLP through UIMA where this should
not be an issue, but I understand that the algorithms are designed to
work with training files that use line breaks to represent sentence
boundaries, i.e. line breaks are used as a meta character that can not
actually occur within the document.
When introducing configurability of EOS characters it would be good to
take that into account and provide a way to deal with line breaks in the
documents.
Jens
--
Dr. Katrin Tomanek
Averbis GmbH
Tennenbacher Strasse 11
D-79106 Freiburg
Fon: +49 (0) 761 - 203 97696
Fax: +49 (0) 761 - 203 97694
E-Mail: katrin.toma...@averbis.com
Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
Sitz der Gesellschaft: Freiburg i. Br.
AG Freiburg i. Br., HRB 701080