Re: OpenNLP Sentence Detector: EOS Characters

Jens Grivolla Thu, 09 Feb 2012 06:57:09 -0800

On 02/09/2012 12:31 PM, Joern Kottmann wrote:

On Thu, Feb 9, 2012 at 11:41 AM, Jens Grivolla<j+...@grivolla.net>  wrote:

[... using line breaks as sentence boundaries ...]
When introducing configurability of EOS characters it would be good to
take that into account and provide a way to deal with line breaks in the
documents.


Actually I think you need to detect the basic document/article structure
first, e.g. headline, sub-headline, paragraphs, bylines, ...
The Sentence Detector is designed to split a paragraph into sentences and
not to detect the document structure.

We mostly work with user generated content (UGC) such as blog posts,forums, twitter, etc. It is therefore not always clear that there is awell-defined document structure. On the other hand, punctuation is veryirregular and we find that in many cases sentences do not have properEOS punctuation. In many cases a newline does imply a sentence break butnot always.

We always considered that the sentence detector splits a document intosentences, not pre-split individual paragraphs. The julielab wrappersfor OpenNLP 1.3 that we are/were using always work on the full documenttext. I see that the new OpenNLP 1.5 UIMA integration has a configurable"ContainerType", so that might be an interesting option for us.

In your case I would try to make an Analysis Engine which can identify your
"text blocks", annotate them with an annotation and then
tell the Sentence Detector AE to only perform sentence splitting on the
text within these annotations (already implemented).
I used this to do news analysis.

Ok, while we don't necessarily have a clearly defined structure we coulddefinitely use simple rules to create generic segments and work onthose. That might actually be a good solution.

We had a couple of bugs with the white space handling in the sentence
detector, these are now fixed. So you should not have any issues with
white spaces handling anymore.

The training of the sentence detector can be done with the UIMA integration
as well, there you need to provide CASes with sentence annotations.

Yes, we have started using those for POStag training, etc., but not yetfor sentence splitting.

Hope this helps,


Yes, it actually does help a lot. :-)

Thanks,
Jens

Re: OpenNLP Sentence Detector: EOS Characters

Reply via email to