Re: Resources

Vyacheslav Zholudev Mon, 19 Sep 2011 03:56:43 -0700

Thanks, James, for the reply. I will have to educate more myself regarding this 
topic. 
Did you experiment with generating training data?


Vyacheslav

On Sep 14, 2011, at 3:47 AM, James Kosin wrote:

> On 9/13/2011 7:10 AM, Vyacheslav Zholudev wrote:
>> Hi,
>> 
>> I'm a bit new to OpenNLP, and I'm interested in the name finder 
>> functionality.
>> The embedded organization model works relatively well for me, but not 
>> sufficiently good. So I decided to go for training. However, I can't achieve 
>> stable results. I would appreciate if anybody could answer a couple of 
>> questions:
>> 
>> 1) What are the characteristics of a good training data set? I have a 
>> training data generator that injects many different organizations into some 
>> set of predefined sentences
> Depends, the current data is trained on news feeds and tends to be
> geared toward that type of data for the best results.  The feature
> generators should be abstracting the name of the organization from the
> sentence and only looking at the surrounding words for the training. 
> This may, or may not be the case... I'm not an expert yet.
> As far as randomly injecting data into the sentences of predefined style
> is also a bad idea; although it may sound good for generating large data
> sets.  This will probably lead to a name finder that memorizes the word
> patterns and not on how to properly find the names.  If you already have
> the predefined sentences you could also just build a filter on the words
> in the sentences to extract the names without the name finder's help. 
> In theory that is what you would end up with when you trained with the
> predefined sentences with the injected names.
> With the model, the idea is to try and abstract the presence of the name
> without actually memorizing anything.
>> 
>> 2) I guess I need to implement adaptive feature generators? Is there some 
>> good documentation how to do so? Maybe even some books? Description of how 
>> namefinder works will definitely be useful.
> Yes, I also guess this will be the case.  I'm also needing to implement
> something myself for the name(s) dictionary I've been able to create
> but, not yet use.
>> 
>> 3) Based on what characteristics I should choose a number of iterations and 
>> cutoff?
> This is dependent on the data and the accuracy you may be trying for. 
> Usually, the default of 5 and 100 are good starting points.  The cutoff
> of 5 means any sentences that have a pattern aren't counted until 5 or
> more sentences are found with the same pattern.  Then the pattern is
> counted as training data.  This allows filtering of spurious bad data
> that may just be a one time fluke or wrongly edited text.  Humans are
> usually the sources for all the data.
>> 
>> 4) Can I train a model for several languages at a time? 
> This won't work well, the languages all use differing rules and patterns
> for names.  Even different character sets are used and all play a factor
> in the training of the model.  It is usually best to train a separate
> model for each language separately.
>> 
>> Any other suggestions/pointers are highly appreciated.
> Try the simple approach first, the name finder works best when the data
> is already parsed and tokenized; so, training the sentence detector and
> tokenizer are usually first before training the name finder model.  The
> biggest problem I've found is finding a large data set of freely
> available data that can be used and possibly distributed.  Our group is
> currently working on a project for that now.  You are more than welcome
> to join us on that.  See the wiki pages....
> 
>> 
>> Thanks a lot in advance,
>> Vyacheslav
>> 
> James
>

Re: Resources

Reply via email to