[jira] [Commented] (OPENNLP-562) invoking .find() on a RegexNameFinder instance brings back Spans with identical start/end indices

James Kosin (JIRA) Thu, 21 Feb 2013 20:06:20 -0800

    [ 
https://issues.apache.org/jira/browse/OPENNLP-562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13583894#comment-13583894
 ]


James Kosin commented on OPENNLP-562:
-------------------------------------

Jim,

Thanks... Unfortunately, my brain cells start firing low after 12:00am EST.  I 
was still working on this at 1:00am EST when my wireless router turned me 
off... otherwise, I'd get little or no sleep.

I'll fix this later; since, it really isn't a big deal.

James
                
> invoking .find() on a RegexNameFinder instance brings back Spans with 
> identical start/end indices
> -------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-562
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-562
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Name Finder
>    Affects Versions: tools-1.5.2-incubating
>         Environment: Ubuntu 12.10 64-bit Java 7 u11
>            Reporter: Jim Piliouras
>            Assignee: James Kosin
>              Labels: bug, regex, span
>             Fix For: tools-1.5.3
>
>         Attachments: OPENNLP-562.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The RegexNameFinder class has a serious bug...Whenever it finds something it 
> produces a Span with the same start/end index. This happens because 
> 'sentencePosTokenMap' stores the same position for the start and end of the 
> token.Conceptually this fine, after all it is the same token, however later 
> on matcher.start()/end() is invoked to determine what to ask from the 
> map.Well, if we've stored the same position we will get the same number and 
> the Span will be ruined, right? The trick here is to store i+1 for the 
> endIndex for that token in the map. That is essentially the position of next 
> token, but since we're expecting tokenized text anyway everything is 
> fine...Untokenized text breaks the system anyway so in my opinion it is safe 
> to apply the forthcoming patch. A dirty approach would be to leave the map as 
> is and simply replace 'matcher.end()' with 'matcher.end()+1' when we're doing 
> the lookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (OPENNLP-562) invoking .find() on a RegexNameFinder instance brings back Spans with identical start/end indices

Reply via email to