NameFinderMe detecting organisations in an HTML sample with limited training
----------------------------------------------------------------------------
Key: OPENNLP-67
URL: https://issues.apache.org/jira/browse/OPENNLP-67
Project: OpenNLP
Issue Type: Question
Components: Name Finder
Affects Versions: tools-1.5.0-sourceforge
Reporter: Paul
I have attached a patch named htmltest.patch.
The patch contains a test named NameFinderMEHtmlTest and 2 embedded resources
named html1.train and html.html. Obviously html1.train is the training sample
which is a sample HTML document marked up with <START:organization> Org <END>
tags. html.html is the same HTML document without the training mark up. The
HTML has been preprocess with all the line break characters removed.
In the NameFinderMEHtmlTest I am training the data and then using find to
retrieve the names.
Was my assumption wrong in thinking that NameFinderME would find the exact
names from the html? The NameFinderMEHtmlTest fails because it does not find
the first name, it does find part of the name. Is this because it has limited
training or is the find method performing badly against html document?
I am new to opennlp so there is an element of guess work as to which streams
etc. I should be using.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.