On 9/13/2011 7:10 AM, Vyacheslav Zholudev wrote: > Hi, > > I'm a bit new to OpenNLP, and I'm interested in the name finder functionality. > The embedded organization model works relatively well for me, but not > sufficiently good. So I decided to go for training. However, I can't achieve > stable results. I would appreciate if anybody could answer a couple of > questions: > > 1) What are the characteristics of a good training data set? I have a > training data generator that injects many different organizations into some > set of predefined sentences Depends, the current data is trained on news feeds and tends to be geared toward that type of data for the best results. The feature generators should be abstracting the name of the organization from the sentence and only looking at the surrounding words for the training. This may, or may not be the case... I'm not an expert yet. As far as randomly injecting data into the sentences of predefined style is also a bad idea; although it may sound good for generating large data sets. This will probably lead to a name finder that memorizes the word patterns and not on how to properly find the names. If you already have the predefined sentences you could also just build a filter on the words in the sentences to extract the names without the name finder's help. In theory that is what you would end up with when you trained with the predefined sentences with the injected names. With the model, the idea is to try and abstract the presence of the name without actually memorizing anything. > > 2) I guess I need to implement adaptive feature generators? Is there some > good documentation how to do so? Maybe even some books? Description of how > namefinder works will definitely be useful. Yes, I also guess this will be the case. I'm also needing to implement something myself for the name(s) dictionary I've been able to create but, not yet use. > > 3) Based on what characteristics I should choose a number of iterations and > cutoff? This is dependent on the data and the accuracy you may be trying for. Usually, the default of 5 and 100 are good starting points. The cutoff of 5 means any sentences that have a pattern aren't counted until 5 or more sentences are found with the same pattern. Then the pattern is counted as training data. This allows filtering of spurious bad data that may just be a one time fluke or wrongly edited text. Humans are usually the sources for all the data. > > 4) Can I train a model for several languages at a time? This won't work well, the languages all use differing rules and patterns for names. Even different character sets are used and all play a factor in the training of the model. It is usually best to train a separate model for each language separately. > > Any other suggestions/pointers are highly appreciated. Try the simple approach first, the name finder works best when the data is already parsed and tokenized; so, training the sentence detector and tokenizer are usually first before training the name finder model. The biggest problem I've found is finding a large data set of freely available data that can be used and possibly distributed. Our group is currently working on a project for that now. You are more than welcome to join us on that. See the wiki pages....
> > Thanks a lot in advance, > Vyacheslav > James