William,
I found a problem with the longest match... I wasn't jumping over the
match after adding it to the spans. This could cause George Bauer and
Bauer to return as two entries if Bauer was in the dictionary.
I'm a little confused on an output I'm getting now:
-----
Running opennlp.tools.namefind.DictionaryNameFinderEvaluatorTest
Expected: {
Since then, our guests have to ring at Veilchenstra§e 11 if they want to
visit us, <START:default> Luise <END> and George Bauer <END>.}
Predicted: {
Since then, our guests have to ring at Veilchenstra§e 11 if they want to
visit us, <START:default> Luise <END> and <START:default> George <END>
Bauer <END>.}
False positives: {
[George]
} False negatives: {
[]
}
----
after adding a line assertTrue(fmeasure.getFmeasure() == 1); to the test
file...?
On 3/19/2012 10:11 PM, William Colen (Commented) (JIRA) wrote:
> [
> https://issues.apache.org/jira/browse/OPENNLP-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233124#comment-13233124
> ]
>
> William Colen commented on OPENNLP-471:
> ---------------------------------------
>
> Ops... sorry ! I am debugging the code and might have found the real reason
> for the previous output. I will investigate it further.
>
>> DictionaryNameFinder has HASHing issues
>> ---------------------------------------
>>
>> Key: OPENNLP-471
>> URL: https://issues.apache.org/jira/browse/OPENNLP-471
>> Project: OpenNLP
>> Issue Type: Bug
>> Components: Name Finder
>> Reporter: James Kosin
>> Assignee: James Kosin
>> Labels: dictionary, namefinder
>> Fix For: tools-1.5.3
>>
>>
>> The DictionaryNameFinder has issues finding multi-token names when the
>> dictionary is searched a token at a time by the find() method. If, the
>> dictionary doesn't have a single (or shorter) token match available in the
>> dictionary.
>> Having a dictionary with {"folic", "acid"} without an entry for {"folic"}
>> will cause the find() method to totally skip the fact there is a longer
>> match possible.
>> Thanks to Jim for pushing this and to my debugging skills to find.
>> Two possiblilites come to mind:
>> 1) I don't really like, is we turn it into a larger problem by trying
>> longer matches when shorter ones don't match. Unfortunately, this turns
>> quickly into a race to see who can wait longer.
>> 2) A way of returning a possible match that may need exploring, or a
>> look-ahead type system to say we don't match "folic" but if you have "acid"
>> after "folic" we have a match for that in the dictionary.
>> 3) Leave it as is and modify the dictionary to add shorter terms to the
>> dictionary... maybe marking as not-a-valid entry so we can know we need a
>> longer match.
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>