Well then +1 on catching this if possible.  I'm sure it isn't the first
time.  But, this really should be handled when tokenized properly I guess.

I was trying to figure out the difference between using the actual
offsets and using the length.  Both are equivalent... or should be in my
eyes.

On 3/19/2012 10:51 PM, [email protected] wrote:
> Looks like we was checking it at the same time.
>
> Yes, because of this log I got confused. Actually this error is caused by
> an issue in the corpus. The is not correctly annotated:
>
> ... to visit us, <START> Luise <END> and <START> George Bauer <END>.
> The <END> was not catch because of the '.' and the corpus parser got
> confused.
>
>
> On Mon, Mar 19, 2012 at 11:41 PM, James Kosin <[email protected]> wrote:
>
>> William,
>>
>> I found a problem with the longest match... I wasn't jumping over the
>> match after adding it to the spans.  This could cause George Bauer and
>> Bauer to return as two entries if Bauer was in the dictionary.
>>
>> I'm a little confused on an output I'm getting now:
>> -----
>> Running opennlp.tools.namefind.DictionaryNameFinderEvaluatorTest
>> Expected: {
>> Since then, our guests have to ring at Veilchenstra§e 11 if they want to
>> visit us, <START:default> Luise <END> and George Bauer <END>.}
>> Predicted: {
>> Since then, our guests have to ring at Veilchenstra§e 11 if they want to
>> visit us, <START:default> Luise <END> and <START:default> George <END>
>> Bauer <END>.}
>> False positives: {
>> [George]
>> } False negatives: {
>> []
>> }
>> ----
>> after adding a line assertTrue(fmeasure.getFmeasure() == 1); to the test
>> file...?
>>
>>
>>
>>
>> On 3/19/2012 10:11 PM, William Colen (Commented) (JIRA) wrote:
>>>     [
>> https://issues.apache.org/jira/browse/OPENNLP-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233124#comment-13233124]
>>> William Colen commented on OPENNLP-471:
>>> ---------------------------------------
>>>
>>> Ops... sorry ! I am debugging the code and might have found the real
>> reason for the previous output. I will investigate it further.
>>>> DictionaryNameFinder has HASHing issues
>>>> ---------------------------------------
>>>>
>>>>                 Key: OPENNLP-471
>>>>                 URL: https://issues.apache.org/jira/browse/OPENNLP-471
>>>>             Project: OpenNLP
>>>>          Issue Type: Bug
>>>>          Components: Name Finder
>>>>            Reporter: James Kosin
>>>>            Assignee: James Kosin
>>>>              Labels: dictionary, namefinder
>>>>             Fix For: tools-1.5.3
>>>>
>>>>
>>>> The DictionaryNameFinder has issues finding multi-token names when the
>> dictionary is searched a token at a time by the find() method.  If, the
>> dictionary doesn't have a single (or shorter) token match available in the
>> dictionary.
>>>> Having a dictionary with {"folic", "acid"} without an entry for
>> {"folic"} will cause the find() method to totally skip the fact there is a
>> longer match possible.
>>>> Thanks to Jim for pushing this and to my debugging skills to find.
>>>> Two possiblilites come to mind:
>>>> 1)  I don't really like, is we turn it into a larger problem by trying
>> longer matches when shorter ones don't match.  Unfortunately, this turns
>> quickly into a race to see who can wait longer.
>>>> 2)  A way of returning a possible match that may need exploring, or a
>> look-ahead type system to say we don't match "folic" but if you have "acid"
>> after "folic" we have a match for that in the dictionary.
>>>> 3)  Leave it as is and modify the dictionary to add shorter terms to
>> the dictionary... maybe marking as not-a-valid entry so we can know we need
>> a longer match.
>>> --
>>> This message is automatically generated by JIRA.
>>> If you think it was sent incorrectly, please contact your JIRA
>> administrators:
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>> For more information on JIRA, see:
>> http://www.atlassian.com/software/jira
>>>
>>

Reply via email to