On 1/25/11 3:22 PM, Paul Cowan wrote:
Hi,
Thanks for your comments on the JIRA.
Should I be expecting exact results if the training data and the sample data
are exactly the same or is there just too little training data to tell at
this stage?
If you are training with a cutoff of 5 then the results might not be
identical,
and even if they are, you want good results on "unkown" data.
That is why you need a certain a mount of training data to get the model
going.
When we have natural language text we divide it into sentences to
extract a unit
we can pass on to the name finder. For me it seems that is more difficult to
get such a unit when working directly on html data. In your case I think
the previous
map feature does not really help. So you could pass a bigger chunk to
the find method than you
usually would do.
Maybe even an entire page you crawl at a time. But then you need to have
good way of
tokenizing this page, because your tokenization should take the html
into account, having
an html element as a token would make sense in my eyes. But you could
also try to just
use the simple tokenizer and play a little with the feature generation,
e.g. increasing the
window size to 5 or even more.
After you have this you still need to annotate training data, which
might not be that nice
with our "text" format, because it would mean that you have to place an
entire page into
one line.
But it should not hard to come up with a new format, then you write a
small parser
and create the NameSample object yourself.
Hope that helps,
Jörn