[jira] [Updated] (LUCENE-7795) WordDelimiterFilter produces invalid offsets in single word case

Michael Braun (JIRA) Thu, 20 Apr 2017 14:28:43 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael Braun updated LUCENE-7795:
----------------------------------
    Description: 
This problem is not present in WordDelimiterGraphFilter, but it is present in 
WordDelimiterFilter's interaction with HTMLStripCharFilter.

Test code:

{code}
public class TestTokenizationIssue2 {
    public static void main(String... args) throws IOException {
        HTMLStripCharFilter charFilter = new HTMLStripCharFilter(getText());
        WhitespaceTokenizer whitespaceTokenizer = new WhitespaceTokenizer();
        whitespaceTokenizer.setReader(charFilter);
       // WordDelimiterGraphFilter wdgf = new 
WordDelimiterGraphFilter(whitespaceTokenizer,
        //       WordDelimiterGraphFilter.GENERATE_WORD_PARTS, 
CharArraySet.EMPTY_SET);

        WordDelimiterFilter wdgf = new WordDelimiterFilter(whitespaceTokenizer,
               WordDelimiterFilter.GENERATE_WORD_PARTS, CharArraySet.EMPTY_SET);
        wdgf.reset();

        while (wdgf.incrementToken()) {
            CharTermAttribute charTermAttribute = 
wdgf.getAttribute(CharTermAttribute.class);
            OffsetAttribute offsetAttribute = 
wdgf.getAttribute(OffsetAttribute.class);

            System.out.println(charTermAttribute.toString() + " - " + 
offsetAttribute.startOffset() + ',' + offsetAttribute.endOffset());
        }
    }

    private static Reader getText() {
        return new StringReader("&#x93;Risk");
    }
}

{code}

The offsets produced by the WordDelimiterFilter are 1,10. With 
WordDelimiterGraphFilter the offsets produced are 0,10. It should be 0,10 as 
this is the original text:    {noformat}&#x93;Risk{noformat}   - and 1 is 
between the ampersand and hash.

Inside WordDelimiterFilter, I believe the conditional branch from "if 
(isSingleWord && startOffset <= savedEndOffset) "   is invalid and it should 
always use the saved start and end offsets because it can't make the assertion 
that the iterator's current and end are reliable markers.


  was:
This problem is not present in WordDelimiterGraphFilter, but it is present in 
WordDelimiterFilter's interaction with HTMLStripCharFilter.

Test code:

{code}
public class TestTokenizationIssue2 {
    public static void main(String... args) throws IOException {
        HTMLStripCharFilter charFilter = new HTMLStripCharFilter(getText());
        WhitespaceTokenizer whitespaceTokenizer = new WhitespaceTokenizer();
        whitespaceTokenizer.setReader(charFilter);
       // WordDelimiterGraphFilter wdgf = new 
WordDelimiterGraphFilter(whitespaceTokenizer,
        //       WordDelimiterGraphFilter.GENERATE_WORD_PARTS, 
CharArraySet.EMPTY_SET);

        WordDelimiterFilter wdgf = new WordDelimiterFilter(whitespaceTokenizer,
               WordDelimiterFilter.GENERATE_WORD_PARTS, CharArraySet.EMPTY_SET);
        wdgf.reset();

        while (wdgf.incrementToken()) {
            CharTermAttribute charTermAttribute = 
wdgf.getAttribute(CharTermAttribute.class);
            OffsetAttribute offsetAttribute = 
wdgf.getAttribute(OffsetAttribute.class);

            System.out.println(charTermAttribute.toString() + " - " + 
offsetAttribute.startOffset() + ',' + offsetAttribute.endOffset());
        }
    }

    private static Reader getText() {
        return new StringReader("&#x93;Risk");
    }
}

{code}

The offsets produced by the WordDelimiterFilter are 1,10. With 
WordDelimiterGraphFilter the offsets produced are 0,10. It should be 0,10 as 
this is the original text:    &#x93;Risk   - and 1 is between the ampersand and 
hash.

Inside WordDelimiterFilter, I believe the conditional branch from "if 
(isSingleWord && startOffset <= savedEndOffset) "   is invalid and it should 
always use the saved start and end offsets because it can't make the assertion 
that the iterator's current and end are reliable markers.



> WordDelimiterFilter produces invalid offsets in single word case
> ----------------------------------------------------------------
>
>                 Key: LUCENE-7795
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7795
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: master (7.0), 6.5
>            Reporter: Michael Braun
>
> This problem is not present in WordDelimiterGraphFilter, but it is present in 
> WordDelimiterFilter's interaction with HTMLStripCharFilter.
> Test code:
> {code}
> public class TestTokenizationIssue2 {
>     public static void main(String... args) throws IOException {
>         HTMLStripCharFilter charFilter = new HTMLStripCharFilter(getText());
>         WhitespaceTokenizer whitespaceTokenizer = new WhitespaceTokenizer();
>         whitespaceTokenizer.setReader(charFilter);
>        // WordDelimiterGraphFilter wdgf = new 
> WordDelimiterGraphFilter(whitespaceTokenizer,
>         //       WordDelimiterGraphFilter.GENERATE_WORD_PARTS, 
> CharArraySet.EMPTY_SET);
>         WordDelimiterFilter wdgf = new 
> WordDelimiterFilter(whitespaceTokenizer,
>                WordDelimiterFilter.GENERATE_WORD_PARTS, 
> CharArraySet.EMPTY_SET);
>         wdgf.reset();
>         while (wdgf.incrementToken()) {
>             CharTermAttribute charTermAttribute = 
> wdgf.getAttribute(CharTermAttribute.class);
>             OffsetAttribute offsetAttribute = 
> wdgf.getAttribute(OffsetAttribute.class);
>             System.out.println(charTermAttribute.toString() + " - " + 
> offsetAttribute.startOffset() + ',' + offsetAttribute.endOffset());
>         }
>     }
>     private static Reader getText() {
>         return new StringReader("&#x93;Risk");
>     }
> }
> {code}
> The offsets produced by the WordDelimiterFilter are 1,10. With 
> WordDelimiterGraphFilter the offsets produced are 0,10. It should be 0,10 as 
> this is the original text:    {noformat}&#x93;Risk{noformat}   - and 1 is 
> between the ampersand and hash.
> Inside WordDelimiterFilter, I believe the conditional branch from "if 
> (isSingleWord && startOffset <= savedEndOffset) "   is invalid and it should 
> always use the saved start and end offsets because it can't make the 
> assertion that the iterator's current and end are reliable markers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-7795) WordDelimiterFilter produces invalid offsets in single word case

Reply via email to