On 02/09/2012 12:31 PM, Joern Kottmann wrote:
On Thu, Feb 9, 2012 at 11:41 AM, Jens Grivolla<j+...@grivolla.net> wrote:
[... using line breaks as sentence boundaries ...]
When introducing configurability of EOS characters it would be good to
take that into account and provide a way to deal with line breaks in the
documents.
Actually I think you need to detect the basic document/article structure
first, e.g. headline, sub-headline, paragraphs, bylines, ...
The Sentence Detector is designed to split a paragraph into sentences and
not to detect the document structure.
We mostly work with user generated content (UGC) such as blog posts,
forums, twitter, etc. It is therefore not always clear that there is a
well-defined document structure. On the other hand, punctuation is very
irregular and we find that in many cases sentences do not have proper
EOS punctuation. In many cases a newline does imply a sentence break but
not always.
We always considered that the sentence detector splits a document into
sentences, not pre-split individual paragraphs. The julielab wrappers
for OpenNLP 1.3 that we are/were using always work on the full document
text. I see that the new OpenNLP 1.5 UIMA integration has a configurable
"ContainerType", so that might be an interesting option for us.
In your case I would try to make an Analysis Engine which can identify your
"text blocks", annotate them with an annotation and then
tell the Sentence Detector AE to only perform sentence splitting on the
text within these annotations (already implemented).
I used this to do news analysis.
Ok, while we don't necessarily have a clearly defined structure we could
definitely use simple rules to create generic segments and work on
those. That might actually be a good solution.
We had a couple of bugs with the white space handling in the sentence
detector, these are now fixed. So you should not have any issues with
white spaces handling anymore.
The training of the sentence detector can be done with the UIMA integration
as well, there you need to provide CASes with sentence annotations.
Yes, we have started using those for POStag training, etc., but not yet
for sentence splitting.
Hope this helps,
Yes, it actually does help a lot. :-)
Thanks,
Jens