[ 
https://issues.apache.org/jira/browse/OPENNLP-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12984030#action_12984030
 ] 

Paul commented on OPENNLP-67:
-----------------------------

Thank you for your detailed feedback Jörn.  I do agree that the training data 
file is too large and I will trim it down before resubmitting.
I will also get a better understanding of both tokenization and feature 
generation before resubmitting the patch.

One thing I am unsure about is how to break up html file for the name finder.  
Sentence detection using a model like the en-sent.bin will obviously not work 
on html, would I need to train my own model or should I look at doing this 
programmatically?

Could you recommended a strategy for breaking up the html?  

> NameFinderMe detecting organisations in an HTML sample with limited training
> ----------------------------------------------------------------------------
>
>                 Key: OPENNLP-67
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-67
>             Project: OpenNLP
>          Issue Type: Question
>          Components: Name Finder
>    Affects Versions: tools-1.5.0-sourceforge
>            Reporter: Paul
>         Attachments: htmltest.patch
>
>
> I have attached a patch named htmltest.patch.  
> The patch contains a test named NameFinderMEHtmlTest and 2 embedded resources 
> named html1.train and html.html.  Obviously html1.train is the training 
> sample which is a sample HTML document marked up with <START:organization> 
> Org <END> tags.  html.html is the same HTML document without the training 
> mark up.  The HTML has been preprocess with all the line break characters 
> removed. 
> In the NameFinderMEHtmlTest I am training the data and then using find to 
> retrieve the names. 
> Was my assumption wrong in thinking that NameFinderME would find the exact 
> names from the html?  I mean exact in this context because both the training 
> html and the test html are the same.  The NameFinderMEHtmlTest fails because 
> it does not find the first name, it does find part of the name.  Is this 
> because it has limited training or is the find method performing badly 
> against html document?
> I am new to opennlp so there is an element of guess work as to which streams 
> etc. I should be using.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to