[
https://issues.apache.org/jira/browse/OPENNLP-562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13583894#comment-13583894
]
James Kosin commented on OPENNLP-562:
-------------------------------------
Jim,
Thanks... Unfortunately, my brain cells start firing low after 12:00am EST. I
was still working on this at 1:00am EST when my wireless router turned me
off... otherwise, I'd get little or no sleep.
I'll fix this later; since, it really isn't a big deal.
James
> invoking .find() on a RegexNameFinder instance brings back Spans with
> identical start/end indices
> -------------------------------------------------------------------------------------------------
>
> Key: OPENNLP-562
> URL: https://issues.apache.org/jira/browse/OPENNLP-562
> Project: OpenNLP
> Issue Type: Bug
> Components: Name Finder
> Affects Versions: tools-1.5.2-incubating
> Environment: Ubuntu 12.10 64-bit Java 7 u11
> Reporter: Jim Piliouras
> Assignee: James Kosin
> Labels: bug, regex, span
> Fix For: tools-1.5.3
>
> Attachments: OPENNLP-562.patch
>
> Original Estimate: 2h
> Remaining Estimate: 2h
>
> The RegexNameFinder class has a serious bug...Whenever it finds something it
> produces a Span with the same start/end index. This happens because
> 'sentencePosTokenMap' stores the same position for the start and end of the
> token.Conceptually this fine, after all it is the same token, however later
> on matcher.start()/end() is invoked to determine what to ask from the
> map.Well, if we've stored the same position we will get the same number and
> the Span will be ruined, right? The trick here is to store i+1 for the
> endIndex for that token in the map. That is essentially the position of next
> token, but since we're expecting tokenized text anyway everything is
> fine...Untokenized text breaks the system anyway so in my opinion it is safe
> to apply the forthcoming patch. A dirty approach would be to leave the map as
> is and simply replace 'matcher.end()' with 'matcher.end()+1' when we're doing
> the lookup.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira