Re: [jira] [Commented] (OPENNLP-471) DictionaryNameFinder has HASHing issues

James Kosin Mon, 19 Mar 2012 19:42:27 -0700

William,

I found a problem with the longest match... I wasn't jumping over the
match after adding it to the spans.  This could cause George Bauer and
Bauer to return as two entries if Bauer was in the dictionary.


I'm a little confused on an output I'm getting now:
-----
Running opennlp.tools.namefind.DictionaryNameFinderEvaluatorTest
Expected: {
Since then, our guests have to ring at Veilchenstra§e 11 if they want to
visit us, <START:default> Luise <END> and George Bauer <END>.}
Predicted: {
Since then, our guests have to ring at Veilchenstra§e 11 if they want to
visit us, <START:default> Luise <END> and <START:default> George <END>
Bauer <END>.}
False positives: {
[George]
} False negatives: {
[]
}
----
after adding a line assertTrue(fmeasure.getFmeasure() == 1); to the test
file...?




On 3/19/2012 10:11 PM, William Colen (Commented) (JIRA) wrote:
>     [ 
> https://issues.apache.org/jira/browse/OPENNLP-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233124#comment-13233124
>  ] 
>
> William Colen commented on OPENNLP-471:
> ---------------------------------------
>
> Ops... sorry ! I am debugging the code and might have found the real reason 
> for the previous output. I will investigate it further.
>                 
>> DictionaryNameFinder has HASHing issues
>> ---------------------------------------
>>
>>                 Key: OPENNLP-471
>>                 URL: https://issues.apache.org/jira/browse/OPENNLP-471
>>             Project: OpenNLP
>>          Issue Type: Bug
>>          Components: Name Finder
>>            Reporter: James Kosin
>>            Assignee: James Kosin
>>              Labels: dictionary, namefinder
>>             Fix For: tools-1.5.3
>>
>>
>> The DictionaryNameFinder has issues finding multi-token names when the 
>> dictionary is searched a token at a time by the find() method.  If, the 
>> dictionary doesn't have a single (or shorter) token match available in the 
>> dictionary.
>> Having a dictionary with {"folic", "acid"} without an entry for {"folic"} 
>> will cause the find() method to totally skip the fact there is a longer 
>> match possible.
>> Thanks to Jim for pushing this and to my debugging skills to find.
>> Two possiblilites come to mind:
>> 1)  I don't really like, is we turn it into a larger problem by trying 
>> longer matches when shorter ones don't match.  Unfortunately, this turns 
>> quickly into a race to see who can wait longer.
>> 2)  A way of returning a possible match that may need exploring, or a 
>> look-ahead type system to say we don't match "folic" but if you have "acid" 
>> after "folic" we have a match for that in the dictionary.
>> 3)  Leave it as is and modify the dictionary to add shorter terms to the 
>> dictionary... maybe marking as not-a-valid entry so we can know we need a 
>> longer match.
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA 
> administrators: 
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>

Re: [jira] [Commented] (OPENNLP-471) DictionaryNameFinder has HASHing issues

Reply via email to