[ 
https://issues.apache.org/jira/browse/OPENNLP-67?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul updated OPENNLP-67:
------------------------

    Attachment: html.patch

I have simplified the html significantly but I am still not getting the 
required results.  The training data and the test data are exactly the same. 

Should I be expecting exact results if the training data and the sample data 
are exactly the same or is there just too little test data to tell at this 
stage?

> NameFinderMe detecting organisations in an HTML sample with limited training
> ----------------------------------------------------------------------------
>
>                 Key: OPENNLP-67
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-67
>             Project: OpenNLP
>          Issue Type: Question
>          Components: Name Finder
>    Affects Versions: tools-1.5.0-sourceforge
>            Reporter: Paul
>         Attachments: html.patch, htmltest.patch
>
>
> I have attached a patch named htmltest.patch.  
> The patch contains a test named NameFinderMEHtmlTest and 2 embedded resources 
> named html1.train and html.html.  Obviously html1.train is the training 
> sample which is a sample HTML document marked up with <START:organization> 
> Org <END> tags.  html.html is the same HTML document without the training 
> mark up.  The HTML has been preprocess with all the line break characters 
> removed. 
> In the NameFinderMEHtmlTest I am training the data and then using find to 
> retrieve the names. 
> Was my assumption wrong in thinking that NameFinderME would find the exact 
> names from the html?  I mean exact in this context because both the training 
> html and the test html are the same.  The NameFinderMEHtmlTest fails because 
> it does not find the first name, it does find part of the name.  Is this 
> because it has limited training or is the find method performing badly 
> against html document?
> I am new to opennlp so there is an element of guess work as to which streams 
> etc. I should be using.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to