Hi Jörn,

thanks for this explanation.
What you are saying means, that the context generator and the eos scanner are not stored in the model, right?

I had assumed this... other ML toolkits, such as e.g. Mallet (which uses the "Pipe"-logic where openlp uses event streams) actually does this.

Maybe this would also be a good improvement...

Best
Katrin
On 02/09/2012 09:56 AM, Joern Kottmann wrote:
When you only do it during training then it will not consider ":" as
a possible split during detection. That explains your drop in accuracy.

It looks like that it is not possible to modify the EOS characters properly
with
the current version. I suggest that you checkout the source code and then
change the defaultEosCharacters array in opennlp.tools.sentdetect.Factory.
With that you are able to do your test and get it working for now.

Anyway we should have an easy way to specify the EOS characters without
implementing a custom Factory class.

Please open a jira to improve this.

Jörn

On Thu, Feb 9, 2012 at 9:21 AM, Katrin Tomanek
<katrin.toma...@averbis.com>wrote:

Hi Jörn,

I only modified the training process.

However, when I check the predictions it turns out that the model never
learns to split at ":" positions.

Shouldn't it be enought to modify the DefaultSDContextGenerator and the
DefaultEndOfSentenceScanner so that these know about ":" as an EOS, right?
Or are there other places where ":" should be added?

Best
Katrin



On 02/09/2012 09:18 AM, Joern Kottmann wrote:

Did you modify the evaluation as well? If you just do it during training
the
evaluator will not be able to consider ":" as en EOS character.

For me it sounds like that it fails to split on the ":" in some place.

The sentence detector uses a maxent model to classify every EOS character
as either a SPLIT or NO_SPLIT.

Jörn

On Thu, Feb 9, 2012 at 8:59 AM, Katrin Tomanek
<katrin.toma...@averbis.com>**wrote:

  Hi Willian,

I am currently using opennlp-1.5.2 and try to use it as an API, i.e. not
to modify this code by write my own code around it. However, what I
described below (with the SDEventStream) results in the same as you are
describing: I am changing the set of EOS characters.

I am just wondering, why adding ":" as an EOS character decreases the
results (dropping von ~80F to 45F in sentence splitting, and ":" is
always
a sentence boundary symbol in my data!)

Looks like I need to debug a little bit more whats happening in the
DefaultSDContextGenerator.




--
Dr. Katrin Tomanek
Averbis GmbH
Tennenbacher Strasse 11
D-79106 Freiburg

Fon: +49 (0) 761 - 203 97696
Fax: +49 (0) 761 - 203 97694
E-Mail: katrin.toma...@averbis.com

Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
Sitz der Gesellschaft: Freiburg i. Br.
AG Freiburg i. Br., HRB 701080




--
Dr. Katrin Tomanek
Averbis GmbH
Tennenbacher Strasse 11
D-79106 Freiburg

Fon: +49 (0) 761 - 203 97696
Fax: +49 (0) 761 - 203 97694
E-Mail: katrin.toma...@averbis.com

Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
Sitz der Gesellschaft: Freiburg i. Br.
AG Freiburg i. Br., HRB 701080

Reply via email to