On 02/08/2012 08:21 PM, Jim - FooBar(); wrote:
-You data is not tokenized
what exactly do you mean? the training data should be tokenized? I'm not
sure i follow...
its after the training that i need to tokenize in order to do NER isn't it?

In OpenNLP the tokenization during training time and execution
time must be identical. Otherwise the performance goes down.
In your case it is whitespace tokenized during training
and tokenized with the english maxent tokenizer during run time.

That does not work well. You need to fix this, otherwise I am confident
that you will never be happy with the results.

- Adaptive data is not cleared
Is that necessary since i'm treating the merged document as a single
document? I mean i can do it but will that make any noticeable
difference? I can easily append a newline when cat-ing the papers...

This is something which cost you again a lot of performance.
Because the previous map feature is strong when you
do it like this. I suggest you either properly clear adaptive data or
don't use the previous map feature generator.

Ignoring it does not really work.

Do you use our command line tools for training?
No, i'm writing my own application that uses the API...

I suggest to use our command line trainer and evaluation
for testing. When you get ok results it is easy to reproduce
it via the API.

You now have one sentence per line in the last sample you posted.
That is ok. Try to fix the other two problems and then try again.

Hope that helps,
Jörn


Reply via email to