[ 
https://issues.apache.org/jira/browse/OPENNLP-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13232973#comment-13232973
 ] 

James Kosin commented on OPENNLP-471:
-------------------------------------

I think maybe William got a little confused.

What should be happening is the Longest matching span is returned; however, the 
F1 score is based on successful and unsuccessful matches on what should be new 
data (test data) apart from the training data... which is probably why the low 
score.  The recall is how well it matched the ones it did find, and 100% sounds 
right  Since all the data in the dictionary should be found.

I'll have to verify this.

James
                
> DictionaryNameFinder has HASHing issues
> ---------------------------------------
>
>                 Key: OPENNLP-471
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-471
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Name Finder
>            Reporter: James Kosin
>            Assignee: James Kosin
>              Labels: dictionary, namefinder
>             Fix For: tools-1.5.3
>
>
> The DictionaryNameFinder has issues finding multi-token names when the 
> dictionary is searched a token at a time by the find() method.  If, the 
> dictionary doesn't have a single (or shorter) token match available in the 
> dictionary.
> Having a dictionary with {"folic", "acid"} without an entry for {"folic"} 
> will cause the find() method to totally skip the fact there is a longer match 
> possible.
> Thanks to Jim for pushing this and to my debugging skills to find.
> Two possiblilites come to mind:
> 1)  I don't really like, is we turn it into a larger problem by trying longer 
> matches when shorter ones don't match.  Unfortunately, this turns quickly 
> into a race to see who can wait longer.
> 2)  A way of returning a possible match that may need exploring, or a 
> look-ahead type system to say we don't match "folic" but if you have "acid" 
> after "folic" we have a match for that in the dictionary.
> 3)  Leave it as is and modify the dictionary to add shorter terms to the 
> dictionary... maybe marking as not-a-valid entry so we can know we need a 
> longer match.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to