One issue I observed is that the previous map could work much better.
In German person names the word "von" (from) can appear, but it occurs often
again in the article without be part of a name. That reduces the weight
of the previous
map features.
I also experimented with combining the previous map features with the
context,
e.g. word before or word after.
I think having a list of tokens which should not be tracked by the
previous map will
usually help to improve its performance.
Jörn
On 9/2/11 4:36 AM, Jason Baldridge wrote:
Before starting to make significant changes just based on the numbers, it
would be very useful to see the system output side by side with the gold
annotations to see if there are clear patterns in the errors.
-Jason
On Thu, Sep 1, 2011 at 3:41 PM, Jörn Kottmann<[email protected]> wrote:
On 9/1/11 4:50 PM, [email protected] wrote:
Maybe you need some language specific features. I just evaluated the
Portuguese proper name finder with the default OpenNLP features and got
the
following:
Evaluated 56994 samples with 26462 entities; found: 26623 entities;
correct:
23077.
TOTAL: precision: 86,68%; recall: 87,21%; F1: 86,94%.
prop: precision: 86,68%; recall: 87,21%; F1: 86,94%.
[target:
26462; tp: 23077; fp: 3546]
A friend of mine is working directly with Maxent and got better results
because he is using specific features he developed for Portuguese. But it
is
really difficult to tune it.
I am still not sure how the feature generation should be modified, these
papers
suggest that using prefix and suffix features help. And we already have
such feature
generators, when I use these the recall goes up a little and the precision.
I got now 85% precision, and 44% recall, but I still would like to get a
much higher
recall some where in the range of 70% or even 80%.
Some also use trigger words, not sure if that helps much, or other
dictionaries.
Maybe compound noun splitting helps, not sure.
Or should I try to use a topic model, like they do in more modern NERs?
Jörn