Michael Braun created LUCENE-7766:
-------------------------------------
Summary: WordDelimiter(Graph)Filter does not handle split offsets
after HTMLStripCharFilter correctly
Key: LUCENE-7766
URL: https://issues.apache.org/jira/browse/LUCENE-7766
Project: Lucene - Core
Issue Type: Bug
Affects Versions: 6.5, 6.2.1
Reporter: Michael Braun
When using the HTMLStripCharFilter before the WordDelimiterGraphFilter (or
WordDelimiterFilter - I tested with both), the stripping of html from the text
results in the inability to produce correct offsets for split tokens.
Configured with generate word parts, split on case change, and preserve
original:
Example string: "MayBe" produces these offsets (Word - start,end)
{code}
MayBe - 0,5
May - 0,3
Be - 3,5
{code}
Example string "May<b>Be</b>" produces these offsets (Word- start,end)
{code}
MayBe - 0,12
May - 0,12
Be - 0,12
{code}
Notice that 'may' and 'be' are created but the offsets are the same as the
original 'MayBe'.
I traced this down to logic within the WordDelimiterGraphFilter (and the
WordDelimiterFilter before that) to how 'hasIllegalOffsets' is calculated, as
is in the source code:
{code}
// if length by start + end offsets doesn't match the term's text then set
offsets for all our word parts/concats to the incoming
// offsets. this can happen if WDGF is applied to an injected synonym, or
to a stem'd form, etc:
hasIllegalOffsets = (savedEndOffset - savedStartOffset != savedTermLength);
{code}
Here is sample code that can show the issue:
{code}
public class TestTokenizationIssue {
public static void main(String... args) throws IOException {
HTMLStripCharFilter charFilter = new HTMLStripCharFilter(getText());
WhitespaceTokenizer whitespaceTokenizer = new WhitespaceTokenizer();
whitespaceTokenizer.setReader(charFilter);
WordDelimiterGraphFilter wdgf = new
WordDelimiterGraphFilter(whitespaceTokenizer,
WordDelimiterGraphFilter.GENERATE_WORD_PARTS |
WordDelimiterGraphFilter.SPLIT_ON_CASE_CHANGE |
WordDelimiterGraphFilter.PRESERVE_ORIGINAL,
CharArraySet.EMPTY_SET);
wdgf.reset();
while (wdgf.incrementToken()) {
CharTermAttribute charTermAttribute =
wdgf.getAttribute(CharTermAttribute.class);
OffsetAttribute offsetAttribute =
wdgf.getAttribute(OffsetAttribute.class);
System.out.println(charTermAttribute.toString() + " - " +
offsetAttribute.startOffset() + ',' + offsetAttribute.endOffset());
}
}
private static Reader getText() {
return new StringReader("MayBe");
//return new StringReader("May<b>Be</b>");
}
}
{code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]