Hi Jörn,

Good, I'll have a look at the dev list tomorrow.

But still a question on the EOS symbols:

For some testing, I just overwrote the SentenceDetectorME.train(...) method, where I basically changed the way the EventStream was so up to:

EventStream eventStream = new SDEventStream(sampleStreamTrain,
        new DefaultSDContextGenerator(new char[]{'.', '!', '?',':'}),
        new DefaultEndOfSentenceScanner(new char[]{'.', '!', '?',':'}));


--> I thought doing so I would have added ":" as a possible sentence boundary. However, this did not really help -- the model rather gets worse. Maybe I still misunderstood something in how the EOS symbols work?

Best
Katrin



On 02/08/2012 06:04 PM, Joern Kottmann wrote:
On Wed, Feb 8, 2012 at 5:52 PM, Katrin Tomanek
<katrin.toma...@averbis.com>wrote:

Hi everybody,

I was just evaluating the opennlp sentence detector trained on some of our
data (using the Evaluator-class provided with opennlp). It did not perform
very well and when I checked out the misclassified sentences and debugged a
little bit, I realized that only these EOS (end of sentence) characters are
currently supported:

'.', '!', '?'

However, in our case we have many other EOS (":" as one of the most common
ones)

As I understood, the EOS s definied in DefaultSDContextGenerator.java
which is called from SentenceDetectorME.train(...).

If I got it correctly, there is currently no way to configure (as a
parameter or so) the EOS characters. Right?

Of course, I could write my own train method and do things differently,
but then, I would not be able to use the Evaluator and CrossValidator
classes which I find very handy.

Did I miss understand anything and is there a way to configure which EOS
characters should be used ? If not, do you think it would be a good thing
to have and if so, how can I contribute at this point?



You are absolutely right we should have this option. William just started a
thread on the dev list
to discuss this.

Our current idea to solve it is that you can pass in the name of a Factory
class which can
put the SentenceDetector together the way you need it.

But when I now think about it we maybe should define a Properties file
which can contain
custom configuration for a component. In this file we could have a property
for a custom factory
class and maybe a property which contains the EOS chars for the Sentence
Detector.

Anyway help is always very welcome. We should make a decision on how we
will implement
it in the thread on the dev list and then we can open a few jiras to
actually do the work.
This way you should be able to contribute easily.

Jörn



--
Dr. Katrin Tomanek
Averbis GmbH
Tennenbacher Strasse 11
D-79106 Freiburg

Fon: +49 (0) 761 - 203 97696
Fax: +49 (0) 761 - 203 97694
E-Mail: katrin.toma...@averbis.com

Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
Sitz der Gesellschaft: Freiburg i. Br.
AG Freiburg i. Br., HRB 701080

Reply via email to