Re: OpenNLP Sentence Detector: EOS Characters

Katrin Tomanek Thu, 09 Feb 2012 01:31:23 -0800

Hi,

but somehow we have to force the model to know which context generatorand eos scanner to use. Otherwise, the features extracted duringtraining and during testing are inconsistent.

I believe this is absolutely to be avoided -- otherways I cannot trustmy model.


Maybe, for convenience, this could be done:

- if you are just interested in using different EOS chars (but stillgoing with the default context generator for feature extraction), thenEOS chars can be defined during training and will be stored in the model- and for better customization, we might also allow users to specify acustom factory which is stored in the model and insures consistency.


What do you think?

We should probably start by the EOS thing, but keep in mind the secondstep...


Best
Katrin




On 02/09/2012 10:19 AM, Joern Kottmann wrote:

Yes, we should store the class name of the Factory in the model,
because storing the class itself there is a security problem.

Anyway in my opinion you don't want to add an extra jar file to the
classpath
just for a custom EOS character configuration.

So we should do both.

Jörn

On Thu, Feb 9, 2012 at 10:15 AM, Katrin Tomanek
<katrin.toma...@averbis.com>wrote:

Hi Jörn,

but I think one should even go a step further and store the factory in the
model.

At the moment, when instantiating a new Sentence Detector this happens:

  public SentenceDetectorME(**SentenceModel model) {
    this(model, new Factory());
  }

This means, that the factory is not stored in the model. Thus, if I use a
specific factory (because, e.g., you want a special way to generate the
features/context), you currently have no way to store this in the model.

This could be come a problem, if you trained a model with one kind of
context generator and apply this model on events which come from another
context generator. Since the features are different, applying the model
would make too much sense...

Best
Katrin


On 02/09/2012 10:10 AM, Joern Kottmann wrote:

We alreay have a properties file inside the model. It wouldn't be a
difficult
fix to add a property to it which stores the EOS characters which have
been
used during training.

Jörn

On Thu, Feb 9, 2012 at 10:06 AM, Katrin Tomanek
<katrin.toma...@averbis.com>**wrote:

  Hi Jörn,


thanks for this explanation.
What you are saying means, that the context generator and the eos scanner
are not stored in the model, right?

I had assumed this... other ML toolkits, such as e.g. Mallet (which uses
the "Pipe"-logic where openlp uses event streams) actually does this.

Maybe this would also be a good improvement...

Best
Katrin

On 02/09/2012 09:56 AM, Joern Kottmann wrote:

  When you only do it during training then it will not consider ":" as

a possible split during detection. That explains your drop in accuracy.

It looks like that it is not possible to modify the EOS characters
properly
with
the current version. I suggest that you checkout the source code and
then
change the defaultEosCharacters array in opennlp.tools.sentdetect.**

Factory.
With that you are able to do your test and get it working for now.

Anyway we should have an easy way to specify the EOS characters without
implementing a custom Factory class.

Please open a jira to improve this.

Jörn

On Thu, Feb 9, 2012 at 9:21 AM, Katrin Tomanek
<katrin.toma...@averbis.com>****wrote:

  Hi Jörn,


I only modified the training process.

However, when I check the predictions it turns out that the model never
learns to split at ":" positions.

Shouldn't it be enought to modify the DefaultSDContextGenerator and the
DefaultEndOfSentenceScanner so that these know about ":" as an EOS,
right?
Or are there other places where ":" should be added?

Best
Katrin



On 02/09/2012 09:18 AM, Joern Kottmann wrote:

  Did you modify the evaluation as well? If you just do it during
training

the
evaluator will not be able to consider ":" as en EOS character.

For me it sounds like that it fails to split on the ":" in some place.

The sentence detector uses a maxent model to classify every EOS
character
as either a SPLIT or NO_SPLIT.

Jörn

On Thu, Feb 9, 2012 at 8:59 AM, Katrin Tomanek
<katrin.toma...@averbis.com>******wrote:



  Hi Willian,

I am currently using opennlp-1.5.2 and try to use it as an API, i.e.
not
to modify this code by write my own code around it. However, what I
described below (with the SDEventStream) results in the same as you
are
describing: I am changing the set of EOS characters.

I am just wondering, why adding ":" as an EOS character decreases the
results (dropping von ~80F to 45F in sentence splitting, and ":" is
always
a sentence boundary symbol in my data!)

Looks like I need to debug a little bit more whats happening in the
DefaultSDContextGenerator.

--

Dr. Katrin Tomanek
Averbis GmbH
Tennenbacher Strasse 11
D-79106 Freiburg

Fon: +49 (0) 761 - 203 97696
Fax: +49 (0) 761 - 203 97694
E-Mail: katrin.toma...@averbis.com

Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
Sitz der Gesellschaft: Freiburg i. Br.
AG Freiburg i. Br., HRB 701080

--
Dr. Katrin Tomanek
Averbis GmbH
Tennenbacher Strasse 11
D-79106 Freiburg

Fon: +49 (0) 761 - 203 97696
Fax: +49 (0) 761 - 203 97694
E-Mail: katrin.toma...@averbis.com

Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
Sitz der Gesellschaft: Freiburg i. Br.
AG Freiburg i. Br., HRB 701080


--
Dr. Katrin Tomanek
Averbis GmbH
Tennenbacher Strasse 11
D-79106 Freiburg

Fon: +49 (0) 761 - 203 97696
Fax: +49 (0) 761 - 203 97694
E-Mail: katrin.toma...@averbis.com

Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
Sitz der Gesellschaft: Freiburg i. Br.
AG Freiburg i. Br., HRB 701080



--
Dr. Katrin Tomanek
Averbis GmbH
Tennenbacher Strasse 11
D-79106 Freiburg

Fon: +49 (0) 761 - 203 97696
Fax: +49 (0) 761 - 203 97694
E-Mail: katrin.toma...@averbis.com

Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
Sitz der Gesellschaft: Freiburg i. Br.
AG Freiburg i. Br., HRB 701080

Re: OpenNLP Sentence Detector: EOS Characters

Reply via email to