Thanks.... We have some collateral damage for the NameSampleDataStreamTest that is failing.... I'm trying to fix all this locally as well. yes... Although your idea looks better than using getEnd() method I was using and checked in... I may change it.
James On 3/19/2012 10:51 PM, [email protected] wrote: > Looks like we was checking it at the same time. > > Yes, because of this log I got confused. Actually this error is caused by > an issue in the corpus. The is not correctly annotated: > > ... to visit us, <START> Luise <END> and <START> George Bauer <END>. > The <END> was not catch because of the '.' and the corpus parser got > confused. > > > On Mon, Mar 19, 2012 at 11:41 PM, James Kosin <[email protected]> wrote: > >> William, >> >> I found a problem with the longest match... I wasn't jumping over the >> match after adding it to the spans. This could cause George Bauer and >> Bauer to return as two entries if Bauer was in the dictionary. >> >> I'm a little confused on an output I'm getting now: >> ----- >> Running opennlp.tools.namefind.DictionaryNameFinderEvaluatorTest >> Expected: { >> Since then, our guests have to ring at Veilchenstra§e 11 if they want to >> visit us, <START:default> Luise <END> and George Bauer <END>.} >> Predicted: { >> Since then, our guests have to ring at Veilchenstra§e 11 if they want to >> visit us, <START:default> Luise <END> and <START:default> George <END> >> Bauer <END>.} >> False positives: { >> [George] >> } False negatives: { >> [] >> } >> ---- >> after adding a line assertTrue(fmeasure.getFmeasure() == 1); to the test >> file...? >> >> >> >> >> On 3/19/2012 10:11 PM, William Colen (Commented) (JIRA) wrote: >>> [ >> https://issues.apache.org/jira/browse/OPENNLP-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233124#comment-13233124] >>> William Colen commented on OPENNLP-471: >>> --------------------------------------- >>> >>> Ops... sorry ! I am debugging the code and might have found the real >> reason for the previous output. I will investigate it further. >>>> DictionaryNameFinder has HASHing issues >>>> --------------------------------------- >>>> >>>> Key: OPENNLP-471 >>>> URL: https://issues.apache.org/jira/browse/OPENNLP-471 >>>> Project: OpenNLP >>>> Issue Type: Bug >>>> Components: Name Finder >>>> Reporter: James Kosin >>>> Assignee: James Kosin >>>> Labels: dictionary, namefinder >>>> Fix For: tools-1.5.3 >>>> >>>> >>>> The DictionaryNameFinder has issues finding multi-token names when the >> dictionary is searched a token at a time by the find() method. If, the >> dictionary doesn't have a single (or shorter) token match available in the >> dictionary. >>>> Having a dictionary with {"folic", "acid"} without an entry for >> {"folic"} will cause the find() method to totally skip the fact there is a >> longer match possible. >>>> Thanks to Jim for pushing this and to my debugging skills to find. >>>> Two possiblilites come to mind: >>>> 1) I don't really like, is we turn it into a larger problem by trying >> longer matches when shorter ones don't match. Unfortunately, this turns >> quickly into a race to see who can wait longer. >>>> 2) A way of returning a possible match that may need exploring, or a >> look-ahead type system to say we don't match "folic" but if you have "acid" >> after "folic" we have a match for that in the dictionary. >>>> 3) Leave it as is and modify the dictionary to add shorter terms to >> the dictionary... maybe marking as not-a-valid entry so we can know we need >> a longer match. >>> -- >>> This message is automatically generated by JIRA. >>> If you think it was sent incorrectly, please contact your JIRA >> administrators: >> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa >>> For more information on JIRA, see: >> http://www.atlassian.com/software/jira >>> >>
