Hi all,
    I'm looking for using Wikipedia as a source to train my own NameFinder.

The main idea is based on two assumptions:
1) Almost each Wikipedia Article has a template which make easy to classify it as a Person, Place or some kind of entity
2) Each Wikipedia Article contains hyper text to other Wikipedia Articles

Given that is possible to translate links to typed annotations to train the Name Finder.

I know that Olivier has already tried this approach, but I wanted to work on my own implementation and I think this is the right place to discuss about it. There are some general questions and some more specific, regarding the Name Finder.

The general question regards the fact that Wikipedia isn't the "perfect" training set, because not all the entities are linked / tagged. The good thing is that as dataset it is very large, which means a lot of tagged examples and a lot of untagged ones. Do you think this is a huge problem?

What do you think about selecting as training set a subset of pages with high precision? I have some ideas about which strategy to implement: * select only featured pages (which somehow is a guarantee that linking is done properly) * selecting only pages regarding the Name Finder entity I'm trying to train (e.g. only People pages for People Name Finder)

The specific questions regard the right tuning of training parameters, which I think is a frequent question. I hope this discussion may bring to the creation of new material to improve documentation, I advice you I won't be brief. For this I'm starting from some hints given by Jörn:

On 19/01/2012 14:16, Jörn Kottmann wrote:

When I am doing training I always take our defaults as a base line and then modify the parameters to see how it changes the performance. When you are working with a training set which grows over time I suggest to once in a while start again from the default and verify if the modifications are still
giving an improvement.

A few hints:
- Using more iterations on the maxent model helps especially when your data set is small,
   e.g. try 300 to 500 instead of 100.

My dataset is huge, but I will try to test also adding more iterations.


final Map<String, Object> resources


- Depending on domain and language feature generation should be adapted, try to use our xml feature generation (for this use trunk version, there was a severe bug in 1.5.2).


For feature generation, I admit I haven't the idea on how use it. I'm using the CachedFeatureGenerator exactly as instantiated in the documentation. Can you help me in explaining them?

new WindowFeatureGenerator(new TokenFeatureGenerator(), 2, 2):
this one means that the two previous and next tokens are used as features to train the model: the window size probably depends on the language and shouldn't be too big to avoid loosing generalization.

new WindowFeatureGenerator(new TokenClassFeatureGenerator(true), 2, 2):
this one is similar to the former, but it uses the class of token instead of the token itself. Let's say I can do POS-tagging on my sentences, which is a classification for them. I think this may be an interesting property to detect named entities (e.g. a Place is often introduced by a token whith the pos IN). How can I exploit this idea?

new OutcomePriorFeatureGenerator(),
new PreviousMapFeatureGenerator(),
new BigramNameFeatureGenerator():
these FeatureGenerator aren't very clear to me and I would like to get more in depth. I can only understand that they aren't used by default.

new SentenceFeatureGenerator(true, false):
used to keep or skip the first and the last word of a sentence as feature (depending on the boolean parameters given in input). Which is the rationale to keep the first word and skip the last word? How can I decide of this setting? Which are the possible customizations of this FeatureGenerator?

- Try the perceptron, usually has a higher recall, train it with a cutoff of 0.

- Use our build-in evaluation to test how a model performs, it can output performance numbers
   and print out misclassified samples.

- Look carefully at misclassified samples maybe there is are patterns which do not really work
   with your model.

- Add training data which contains cases which should work but do not.

Hope this helps,
Jörn

Thank you for these hints, I will try each one carefully.

Regards,
    Riccardo

Reply via email to