[
https://issues.apache.org/jira/browse/OPENNLP-67?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul updated OPENNLP-67:
------------------------
Attachment: html.patch
I have simplified the html significantly but I am still not getting the
required results. The training data and the test data are exactly the same.
Should I be expecting exact results if the training data and the sample data
are exactly the same or is there just too little test data to tell at this
stage?
> NameFinderMe detecting organisations in an HTML sample with limited training
> ----------------------------------------------------------------------------
>
> Key: OPENNLP-67
> URL: https://issues.apache.org/jira/browse/OPENNLP-67
> Project: OpenNLP
> Issue Type: Question
> Components: Name Finder
> Affects Versions: tools-1.5.0-sourceforge
> Reporter: Paul
> Attachments: html.patch, htmltest.patch
>
>
> I have attached a patch named htmltest.patch.
> The patch contains a test named NameFinderMEHtmlTest and 2 embedded resources
> named html1.train and html.html. Obviously html1.train is the training
> sample which is a sample HTML document marked up with <START:organization>
> Org <END> tags. html.html is the same HTML document without the training
> mark up. The HTML has been preprocess with all the line break characters
> removed.
> In the NameFinderMEHtmlTest I am training the data and then using find to
> retrieve the names.
> Was my assumption wrong in thinking that NameFinderME would find the exact
> names from the html? I mean exact in this context because both the training
> html and the test html are the same. The NameFinderMEHtmlTest fails because
> it does not find the first name, it does find part of the name. Is this
> because it has limited training or is the find method performing badly
> against html document?
> I am new to opennlp so there is an element of guess work as to which streams
> etc. I should be using.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.