NameFinderMe detecting organisations in an HTML sample with limited training
----------------------------------------------------------------------------

                 Key: OPENNLP-67
                 URL: https://issues.apache.org/jira/browse/OPENNLP-67
             Project: OpenNLP
          Issue Type: Question
          Components: Name Finder
    Affects Versions: tools-1.5.0-sourceforge
            Reporter: Paul


I have attached a patch named htmltest.patch.  

The patch contains a test named NameFinderMEHtmlTest and 2 embedded resources 
named html1.train and html.html.  Obviously html1.train is the training sample 
which is a sample HTML document marked up with <START:organization> Org <END> 
tags.  html.html is the same HTML document without the training mark up.  The 
HTML has been preprocess with all the line break characters removed. 

In the NameFinderMEHtmlTest I am training the data and then using find to 
retrieve the names. 

Was my assumption wrong in thinking that NameFinderME would find the exact 
names from the html?  The NameFinderMEHtmlTest fails because it does not find 
the first name, it does find part of the name.  Is this because it has limited 
training or is the find method performing badly against html document?

I am new to opennlp so there is an element of guess work as to which streams 
etc. I should be using.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to