[jira] [Created] (OPENNLP-562) invoking .find() on a RegexNameFinder instance brings back Spans with identical start/end indices

Jim Piliouras (JIRA) Wed, 20 Feb 2013 11:19:15 -0800

Jim Piliouras created OPENNLP-562:
-------------------------------------

             Summary: invoking .find() on a RegexNameFinder instance brings 
back Spans with identical start/end indices
                 Key: OPENNLP-562
                 URL: https://issues.apache.org/jira/browse/OPENNLP-562
             Project: OpenNLP
          Issue Type: Bug
          Components: Name Finder
    Affects Versions: tools-1.5.2-incubating
         Environment: Ubuntu 12.10 64-bit Java 7 u11
            Reporter: Jim Piliouras
             Fix For: tools-1.5.3



The RegexNameFinder class has a serious bug...Whenever it finds something it 
produces a Span with the same start/end index. This happens because 
'sentencePosTokenMap' stores the same position for the start and end of the 
token.Conceptually this fine, after all it is the same token, however later on 
matcher.start()/end() is invoked to determine what to ask from the map.Well, if 
we've stored the same position we will get the same number and the Span will be 
ruined, right? The trick here is to store i+1 for the endIndex for that token 
in the map. That is essentially the position of next token, but since we're 
expecting tokenized text anyway everything is fine...Untokenized text breaks 
the system anyway so in my opinion it is safe to apply the forthcoming patch. A 
dirty approach would be to leave the map as is and simply replace 
'matcher.end()' with 'matcher.end()+1' when we're doing the lookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (OPENNLP-562) invoking .find() on a RegexNameFinder instance brings back Spans with identical start/end indices

Reply via email to