[ 
https://issues.apache.org/jira/browse/OPENNLP-67?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul updated OPENNLP-67:
------------------------

    Description: 
I have attached a patch named htmltest.patch.  

The patch contains a test named NameFinderMEHtmlTest and 2 embedded resources 
named html1.train and html.html.  Obviously html1.train is the training sample 
which is a sample HTML document marked up with <START:organization> Org <END> 
tags.  html.html is the same HTML document without the training mark up.  The 
HTML has been preprocess with all the line break characters removed. 

In the NameFinderMEHtmlTest I am training the data and then using find to 
retrieve the names. 

Was my assumption wrong in thinking that NameFinderME would find the exact 
names from the html?  I mean exact in this context because both the training 
html and the test html are the same.  The NameFinderMEHtmlTest fails because it 
does not find the first name, it does find part of the name.  Is this because 
it has limited training or is the find method performing badly against html 
document?

I am new to opennlp so there is an element of guess work as to which streams 
etc. I should be using.

  was:
I have attached a patch named htmltest.patch.  

The patch contains a test named NameFinderMEHtmlTest and 2 embedded resources 
named html1.train and html.html.  Obviously html1.train is the training sample 
which is a sample HTML document marked up with <START:organization> Org <END> 
tags.  html.html is the same HTML document without the training mark up.  The 
HTML has been preprocess with all the line break characters removed. 

In the NameFinderMEHtmlTest I am training the data and then using find to 
retrieve the names. 

Was my assumption wrong in thinking that NameFinderME would find the exact 
names from the html?  The NameFinderMEHtmlTest fails because it does not find 
the first name, it does find part of the name.  Is this because it has limited 
training or is the find method performing badly against html document?

I am new to opennlp so there is an element of guess work as to which streams 
etc. I should be using.


> NameFinderMe detecting organisations in an HTML sample with limited training
> ----------------------------------------------------------------------------
>
>                 Key: OPENNLP-67
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-67
>             Project: OpenNLP
>          Issue Type: Question
>          Components: Name Finder
>    Affects Versions: tools-1.5.0-sourceforge
>            Reporter: Paul
>         Attachments: htmltest.patch
>
>
> I have attached a patch named htmltest.patch.  
> The patch contains a test named NameFinderMEHtmlTest and 2 embedded resources 
> named html1.train and html.html.  Obviously html1.train is the training 
> sample which is a sample HTML document marked up with <START:organization> 
> Org <END> tags.  html.html is the same HTML document without the training 
> mark up.  The HTML has been preprocess with all the line break characters 
> removed. 
> In the NameFinderMEHtmlTest I am training the data and then using find to 
> retrieve the names. 
> Was my assumption wrong in thinking that NameFinderME would find the exact 
> names from the html?  I mean exact in this context because both the training 
> html and the test html are the same.  The NameFinderMEHtmlTest fails because 
> it does not find the first name, it does find part of the name.  Is this 
> because it has limited training or is the find method performing badly 
> against html document?
> I am new to opennlp so there is an element of guess work as to which streams 
> etc. I should be using.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to