Re: Training multiple models

Jörn Kottmann Tue, 25 Jan 2011 06:42:24 -0800

On 1/25/11 3:22 PM, Paul Cowan wrote:

Hi,


Thanks for your comments on the JIRA.

Should I be expecting exact results if the training data and the sample data
are exactly the same or is there just too little training data to tell at
this stage?

If you are training with a cutoff of 5 then the results might not beidentical,

and even if they are, you want good results on "unkown" data.

That is why you need a certain a mount of training data to get the modelgoing.

When we have natural language text we divide it into sentences toextract a unit

we can pass on to the name finder. For me it seems that is more difficult to

get such a unit when working directly on html data. In your case I thinkthe previousmap feature does not really help. So you could pass a bigger chunk tothe find method than you

usually would do.

Maybe even an entire page you crawl at a time. But then you need to havegood way oftokenizing this page, because your tokenization should take the htmlinto account, havingan html element as a token would make sense in my eyes. But you couldalso try to justuse the simple tokenizer and play a little with the feature generation,e.g. increasing the

window size to 5 or even more.

After you have this you still need to annotate training data, whichmight not be that nicewith our "text" format, because it would mean that you have to place anentire page into

one line.

But it should not hard to come up with a new format, then you write asmall parser

and create the NameSample object yourself.

Hope that helps,
Jörn

Re: Training multiple models

Reply via email to