Re: Resources

James Kosin Tue, 13 Sep 2011 18:47:41 -0700

On 9/13/2011 7:10 AM, Vyacheslav Zholudev wrote:
> Hi,
>
> I'm a bit new to OpenNLP, and I'm interested in the name finder functionality.
> The embedded organization model works relatively well for me, but not 
> sufficiently good. So I decided to go for training. However, I can't achieve 
> stable results. I would appreciate if anybody could answer a couple of 
> questions:
>
> 1) What are the characteristics of a good training data set? I have a 
> training data generator that injects many different organizations into some 
> set of predefined sentences
Depends, the current data is trained on news feeds and tends to be
geared toward that type of data for the best results.  The feature
generators should be abstracting the name of the organization from the
sentence and only looking at the surrounding words for the training. 
This may, or may not be the case... I'm not an expert yet.
As far as randomly injecting data into the sentences of predefined style
is also a bad idea; although it may sound good for generating large data
sets.  This will probably lead to a name finder that memorizes the word
patterns and not on how to properly find the names.  If you already have
the predefined sentences you could also just build a filter on the words
in the sentences to extract the names without the name finder's help. 
In theory that is what you would end up with when you trained with the
predefined sentences with the injected names.
With the model, the idea is to try and abstract the presence of the name
without actually memorizing anything.
>
> 2) I guess I need to implement adaptive feature generators? Is there some 
> good documentation how to do so? Maybe even some books? Description of how 
> namefinder works will definitely be useful.
Yes, I also guess this will be the case.  I'm also needing to implement
something myself for the name(s) dictionary I've been able to create
but, not yet use.
>
> 3) Based on what characteristics I should choose a number of iterations and 
> cutoff?
This is dependent on the data and the accuracy you may be trying for. 
Usually, the default of 5 and 100 are good starting points.  The cutoff
of 5 means any sentences that have a pattern aren't counted until 5 or
more sentences are found with the same pattern.  Then the pattern is
counted as training data.  This allows filtering of spurious bad data
that may just be a one time fluke or wrongly edited text.  Humans are
usually the sources for all the data.
>
> 4) Can I train a model for several languages at a time? 
This won't work well, the languages all use differing rules and patterns
for names.  Even different character sets are used and all play a factor
in the training of the model.  It is usually best to train a separate
model for each language separately.
>
> Any other suggestions/pointers are highly appreciated.
Try the simple approach first, the name finder works best when the data
is already parsed and tokenized; so, training the sentence detector and
tokenizer are usually first before training the name finder model.  The
biggest problem I've found is finding a large data set of freely
available data that can be used and possibly distributed.  Our group is
currently working on a project for that now.  You are more than welcome
to join us on that.  See the wiki pages....


>
> Thanks a lot in advance,
> Vyacheslav
>  
James

Re: Resources

Reply via email to