On Sun, Dec 4, 2011 at 12:58 PM, Olivier Grisel <[email protected]>wrote:
> 2011/12/4 Erel Segal <[email protected]>: > > Hello, > > > > I am trying to use NameFinder for a project of detecting organization > names > > in Wikipedia. I use the *en-ner-organization.bin* model that I downloaded > > from the site, and I get strange results. Can you please help me > understand > > these results, and what I should do to correct them? > > > > I ran NameFinder on two sentences from this page: > > http://en.wikipedia.org/wiki/Air_France > > > > The first sentence was: "*For its Première cabin , Air France 's first > > class menu is designed by Guy Martin , chef of Le Grand Vefour , a > Michelin > > three-star restaurant in Paris .*" For this sentence, the name finder > > correctly tagged "*Air France*" as an organization. > > > > However, the second sentence was "*The following day , Air France was > > further instructed to share African routes with Air Afrique and UAT .*". > > For this, the name finder tagged only "*Air*" as an organization. > > > > This seems strange as the two contexts seem similar. How can you explain > > this? > > In the first example the "'s" possessive marker must be a strong clue > that the token right before it must be some kind of named entity > (hence organization in that case) whereas this clue is missing in the > second example. But as OpenNLP NER models are statistical models with > many dimensions (features) it's hard to pick any single reason to > explain individual failures. > > Anyway the only way to fix this, it to retrain the model on a larger > annotated dataset than the one used initially to build the default > model or to come up with better features (the only way to tell is by > evaluating them on an annotated corpus). Both task require significant > investments unfortunately. > > -- > Olivier > http://twitter.com/ogrisel - http://github.com/ogrisel > OK, well, how can I find the original corpus that was used to create this model? I would like to use it as a basis for training a new model. Thanks, Erel
