Michael Braun created LUCENE-7766:
-------------------------------------

             Summary: WordDelimiter(Graph)Filter does not handle split offsets 
after HTMLStripCharFilter correctly
                 Key: LUCENE-7766
                 URL: https://issues.apache.org/jira/browse/LUCENE-7766
             Project: Lucene - Core
          Issue Type: Bug
    Affects Versions: 6.5, 6.2.1
            Reporter: Michael Braun


When using the HTMLStripCharFilter before the WordDelimiterGraphFilter (or 
WordDelimiterFilter - I tested with both), the stripping of html from the text 
results in the inability to produce correct offsets for split tokens. 

Configured with generate word parts, split on case change, and preserve 
original:

Example string: "MayBe" produces these offsets (Word - start,end)
{code}
MayBe - 0,5
May - 0,3
Be - 3,5
{code}
Example string "May<b>Be</b>" produces these offsets (Word- start,end)
{code}
MayBe - 0,12
May - 0,12
Be - 0,12
{code}

Notice that 'may' and 'be' are created but the offsets are the same as the 
original 'MayBe'.

I traced this down to logic within the WordDelimiterGraphFilter (and the 
WordDelimiterFilter before that)  to how 'hasIllegalOffsets' is calculated, as 
is in the source code:

{code}
    // if length by start + end offsets doesn't match the term's text then set 
offsets for all our word parts/concats to the incoming
    // offsets.  this can happen if WDGF is applied to an injected synonym, or 
to a stem'd form, etc:
    hasIllegalOffsets = (savedEndOffset - savedStartOffset != savedTermLength);
{code}


Here is sample code that can show the issue:

{code}
public class TestTokenizationIssue {
    public static void main(String... args) throws IOException {
        HTMLStripCharFilter charFilter = new HTMLStripCharFilter(getText());
        WhitespaceTokenizer whitespaceTokenizer = new WhitespaceTokenizer();
        whitespaceTokenizer.setReader(charFilter);
        WordDelimiterGraphFilter wdgf = new 
WordDelimiterGraphFilter(whitespaceTokenizer,
                WordDelimiterGraphFilter.GENERATE_WORD_PARTS | 
WordDelimiterGraphFilter.SPLIT_ON_CASE_CHANGE |
                        WordDelimiterGraphFilter.PRESERVE_ORIGINAL, 
CharArraySet.EMPTY_SET);
        wdgf.reset();

        while (wdgf.incrementToken()) {
            CharTermAttribute charTermAttribute = 
wdgf.getAttribute(CharTermAttribute.class);
            OffsetAttribute offsetAttribute = 
wdgf.getAttribute(OffsetAttribute.class);

            System.out.println(charTermAttribute.toString() + " - " + 
offsetAttribute.startOffset() + ',' + offsetAttribute.endOffset());
        }
    }

    private static Reader getText() {
        return new StringReader("MayBe");
        //return new StringReader("May<b>Be</b>");
    }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to