I have written an html parser which I am using to tokenize an html document
(new line characters removed) and pass into the find method of NameFinderME.

I am getting good results for some basic model testing on identically
trained html and sample html (without the <START:organization>...<END> tags
and different company names).

When it comes to training the model, I am calling the static train method of
the NameFinderME.

I have noticed that the tokenization of the training data happens in the
read method of NameSampleDataStream which in turn calls the static parse
method of NameSample.

This method uses the WhitespaceTokenizer to tokenize.

Am I right in saying that I should be using the same tokenizer for both
training and finding?

Should I write something to take care of the NameSample object creation that
uses my HtmlTokenizer or maybe it makes sense to extend the
NamesampleDataStream to allow for the use of other tokenizers?

Cheers

Paul Cowan

Cutting-Edge Solutions (Scotland)

http://thesoftwaresimpleton.blogspot.com/



On 26 January 2011 04:33, Khurram <[email protected]> wrote:

> i am trying to find out what is the corelation between the amount of
> training data and the accuracy of find calls. In other words, at what point
> adding more training data starts to matter less and less and we run into
> deminishing returns...
>
> one more thing: it would be nice to see something like a Statistic object
> populated after finder.train to see how well you have trained the model.
>
> thanks,
>
> On Tue, Jan 25, 2011 at 8:41 AM, Jörn Kottmann <[email protected]> wrote:
>
> > On 1/25/11 3:22 PM, Paul Cowan wrote:
> >
> >> Hi,
> >>
> >> Thanks for your comments on the JIRA.
> >>
> >> Should I be expecting exact results if the training data and the sample
> >> data
> >> are exactly the same or is there just too little training data to tell
> at
> >> this stage?
> >>
> >>
> > If you are training with a cutoff of 5 then the results might not be
> > identical,
> > and even if they are, you want good results on "unkown" data.
> >
> > That is why you need a certain a mount of training data to get the model
> > going.
> >
> > When we have natural language text we divide it into sentences to extract
> a
> > unit
> > we can pass on to the name finder. For me it seems that is more difficult
> > to
> > get such a unit when working directly on html data. In your case I think
> > the previous
> > map feature does not really help. So you could pass a bigger chunk to the
> > find method than you
> > usually would do.
> >
> > Maybe even an entire page you crawl at a time. But then you need to have
> > good way of
> > tokenizing this page, because your tokenization should take the html into
> > account, having
> > an html element as a token would make sense in my eyes. But you could
> also
> > try to just
> > use the simple tokenizer and play a little with the feature generation,
> > e.g. increasing the
> > window size to 5 or even more.
> >
> > After you have this you still need to annotate training data, which might
> > not be that nice
> > with our "text" format, because it would mean that you have to place an
> > entire page into
> > one line.
> >
> > But it should not hard to come up with a new format, then you write a
> small
> > parser
> > and create the NameSample object yourself.
> >
> > Hope that helps,
> > Jörn
> >
> >
>

Reply via email to