Thanks, James, for the reply. I will have to educate more myself regarding this topic. Did you experiment with generating training data?
Vyacheslav On Sep 14, 2011, at 3:47 AM, James Kosin wrote: > On 9/13/2011 7:10 AM, Vyacheslav Zholudev wrote: >> Hi, >> >> I'm a bit new to OpenNLP, and I'm interested in the name finder >> functionality. >> The embedded organization model works relatively well for me, but not >> sufficiently good. So I decided to go for training. However, I can't achieve >> stable results. I would appreciate if anybody could answer a couple of >> questions: >> >> 1) What are the characteristics of a good training data set? I have a >> training data generator that injects many different organizations into some >> set of predefined sentences > Depends, the current data is trained on news feeds and tends to be > geared toward that type of data for the best results. The feature > generators should be abstracting the name of the organization from the > sentence and only looking at the surrounding words for the training. > This may, or may not be the case... I'm not an expert yet. > As far as randomly injecting data into the sentences of predefined style > is also a bad idea; although it may sound good for generating large data > sets. This will probably lead to a name finder that memorizes the word > patterns and not on how to properly find the names. If you already have > the predefined sentences you could also just build a filter on the words > in the sentences to extract the names without the name finder's help. > In theory that is what you would end up with when you trained with the > predefined sentences with the injected names. > With the model, the idea is to try and abstract the presence of the name > without actually memorizing anything. >> >> 2) I guess I need to implement adaptive feature generators? Is there some >> good documentation how to do so? Maybe even some books? Description of how >> namefinder works will definitely be useful. > Yes, I also guess this will be the case. I'm also needing to implement > something myself for the name(s) dictionary I've been able to create > but, not yet use. >> >> 3) Based on what characteristics I should choose a number of iterations and >> cutoff? > This is dependent on the data and the accuracy you may be trying for. > Usually, the default of 5 and 100 are good starting points. The cutoff > of 5 means any sentences that have a pattern aren't counted until 5 or > more sentences are found with the same pattern. Then the pattern is > counted as training data. This allows filtering of spurious bad data > that may just be a one time fluke or wrongly edited text. Humans are > usually the sources for all the data. >> >> 4) Can I train a model for several languages at a time? > This won't work well, the languages all use differing rules and patterns > for names. Even different character sets are used and all play a factor > in the training of the model. It is usually best to train a separate > model for each language separately. >> >> Any other suggestions/pointers are highly appreciated. > Try the simple approach first, the name finder works best when the data > is already parsed and tokenized; so, training the sentence detector and > tokenizer are usually first before training the name finder model. The > biggest problem I've found is finding a large data set of freely > available data that can be used and possibly distributed. Our group is > currently working on a project for that now. You are more than welcome > to join us on that. See the wiki pages.... > >> >> Thanks a lot in advance, >> Vyacheslav >> > James >