2012/1/26 Riccardo Tasso <[email protected]>:
> Hi all,
>    I'm looking for using Wikipedia as a source to train my own NameFinder.
>
> The main idea is based on two assumptions:
> 1) Almost each Wikipedia Article has a template which make easy to classify
> it as a Person, Place or some kind of entity

You should use the DBpedia NTriples dumps instead of parsing the
wikipedia template as done in https://github.com/ogrisel/pignlproc .
The type information for person, places and organization is very good.

> 2) Each Wikipedia Article contains hyper text to other Wikipedia Articles
>
> Given that is possible to translate links to typed annotations to train the
> Name Finder.
>
> I know that Olivier has already tried this approach, but I wanted to work on
> my own implementation and I think this is the right place to discuss about
> it. There are some general questions and some more specific, regarding the
> Name Finder.
>
> The general question regards the fact that Wikipedia isn't the "perfect"
> training set, because not all the entities are linked / tagged. The good
> thing is that as dataset it is very large, which means a lot of tagged
> examples and a lot of untagged ones. Do you think this is a huge problem?

I don't think it's a huge problem for training but it's indeed a
problem the performance evaluation: if you use this some held out
folds from this dataset for performance evaluation (precision, recall,
f1-score of the trained NameFinder model) then the fact that dataset
itself is missing annotation will artificially increase the false
positive rate estimate which will have an potentially great impact on
the evaluation of the precision. The actually precision should be
higher that what's measured.

I think the only way to fix this issue is to manually fix the
annotations of a small portion of the automatically generated dataset
to add the missing annotations. I think we probably need 1000
sentences per type to get a non ridiculous validation set.

Besides performance evaluation, the missing annotation issue will also
bias the model towards negative response, hence increasing the false
negatives rate and decreasing the true model recall.

In my first experiment reported in [1] I had not taken the wikipedia
redirect links into account which did probably aggravate this problem
even further. The current version of the pig script has been fixed
w.r.t redirect handling [2] but I have not found the time to rerun a
complete performance evaluation. This will solve frequent
classification errors such as "China" which is redirected to "People's
Republic of China" in Wikipedia. So just handling the redirect my
improve the quality of the data and hence the trained model by quite a
bit.

[1] 
http://dev.blogs.nuxeo.com/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html
[2] 
https://github.com/ogrisel/pignlproc/blob/master/examples/ner-corpus/02_dbpedia_article_types.pig#L22

Also note that the perceptron model was not available when I ran this
experiment. It's probably more scalable, e.s.p. memory wise and would
be very worth trying again.

> What do you think about selecting as training set a subset of pages with
> high precision? I have some ideas about which strategy to implement:
> * select only featured pages (which somehow is a guarantee that linking is
> done properly)

In my experience the DBpedia type links for Person, Place and
Organization are very good quality. No false positives, there might be
some missing links though. It might be interesting to do some manual
checking of the top 100 recurring false positive names after a first
round of DBpedia extraction => model training => model evaluation on
held out data. Then if a significant portion of those false positive
names are actually missing type info in DBpedia or in the redirect
links, add them manually and iterate.

> * selecting only pages regarding the Name Finder entity I'm trying to train
> (e.g. only People pages for People Name Finder)

I am not sure that is such a good idea. I think having 50% positive
examples and 50% negative examples would be better. However because I
a have no clue on how bad the missing annotation issue is
quantitatively I would abstain to comment on this :)

Anyway if you are interested in reviving the annotation sub-project,
please feel free to do so:

  https://cwiki.apache.org/OPENNLP/opennlp-annotations.html

We need a database of annotated open data text (wikipedia, wikinews,
project Gutemberg...) with human validation metadata and a nice Web UI
to maintain it.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply via email to