On 1/25/11 3:22 PM, Paul Cowan wrote:
Hi,

Thanks for your comments on the JIRA.

Should I be expecting exact results if the training data and the sample data
are exactly the same or is there just too little training data to tell at
this stage?


If you are training with a cutoff of 5 then the results might not be identical,
and even if they are, you want good results on "unkown" data.

That is why you need a certain a mount of training data to get the model going.

When we have natural language text we divide it into sentences to extract a unit
we can pass on to the name finder. For me it seems that is more difficult to
get such a unit when working directly on html data. In your case I think the previous map feature does not really help. So you could pass a bigger chunk to the find method than you
usually would do.

Maybe even an entire page you crawl at a time. But then you need to have good way of tokenizing this page, because your tokenization should take the html into account, having an html element as a token would make sense in my eyes. But you could also try to just use the simple tokenizer and play a little with the feature generation, e.g. increasing the
window size to 5 or even more.

After you have this you still need to annotate training data, which might not be that nice with our "text" format, because it would mean that you have to place an entire page into
one line.

But it should not hard to come up with a new format, then you write a small parser
and create the NameSample object yourself.

Hope that helps,
Jörn

Reply via email to