2011/12/4 Erel Segal <[email protected]>: > Hello, > > I am trying to use NameFinder for a project of detecting organization names > in Wikipedia. I use the *en-ner-organization.bin* model that I downloaded > from the site, and I get strange results. Can you please help me understand > these results, and what I should do to correct them? > > I ran NameFinder on two sentences from this page: > http://en.wikipedia.org/wiki/Air_France > > The first sentence was: "*For its Première cabin , Air France 's first > class menu is designed by Guy Martin , chef of Le Grand Vefour , a Michelin > three-star restaurant in Paris .*" For this sentence, the name finder > correctly tagged "*Air France*" as an organization. > > However, the second sentence was "*The following day , Air France was > further instructed to share African routes with Air Afrique and UAT .*". > For this, the name finder tagged only "*Air*" as an organization. > > This seems strange as the two contexts seem similar. How can you explain > this?
In the first example the "'s" possessive marker must be a strong clue that the token right before it must be some kind of named entity (hence organization in that case) whereas this clue is missing in the second example. But as OpenNLP NER models are statistical models with many dimensions (features) it's hard to pick any single reason to explain individual failures. Anyway the only way to fix this, it to retrain the model on a larger annotated dataset than the one used initially to build the default model or to come up with better features (the only way to tell is by evaluating them on an annotated corpus). Both task require significant investments unfortunately. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
