[
https://issues.apache.org/jira/browse/LUCENE-7766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Braun updated LUCENE-7766:
----------------------------------
Description:
When using the HTMLStripCharFilter before the WordDelimiterGraphFilter (or
WordDelimiterFilter - I tested with both), the stripping of html from the text
results in the inability to produce correct offsets for split tokens.
Configured with generate word parts, split on case change, and preserve
original:
Example string: "MayBe" produces these offsets (Word - start,end)
{code}
MayBe - 0,5
May - 0,3
Be - 3,5
{code}
Example string "May<b>Be</b>" produces these offsets (Word- start,end)
{code}
MayBe - 0,12
May - 0,12
Be - 0,12
{code}
Notice that 'may' and 'be' are created but the offsets are the same as the
original 'MayBe'.
I traced this down to logic within the WordDelimiterGraphFilter (and the
WordDelimiterFilter before that) to how 'hasIllegalOffsets' is calculated, as
is in the source code:
{code}
// if length by start + end offsets doesn't match the term's text then set
offsets for all our word parts/concats to the incoming
// offsets. this can happen if WDGF is applied to an injected synonym, or
to a stem'd form, etc:
hasIllegalOffsets = (savedEndOffset - savedStartOffset != savedTermLength);
{code}
Here is sample code that can show the issue:
{code}
public class TestTokenizationIssue {
public static void main(String... args) throws IOException {
HTMLStripCharFilter charFilter = new HTMLStripCharFilter(getText());
WhitespaceTokenizer whitespaceTokenizer = new WhitespaceTokenizer();
whitespaceTokenizer.setReader(charFilter);
WordDelimiterGraphFilter wdgf = new
WordDelimiterGraphFilter(whitespaceTokenizer,
WordDelimiterGraphFilter.GENERATE_WORD_PARTS |
WordDelimiterGraphFilter.SPLIT_ON_CASE_CHANGE |
WordDelimiterGraphFilter.PRESERVE_ORIGINAL,
CharArraySet.EMPTY_SET);
wdgf.reset();
while (wdgf.incrementToken()) {
CharTermAttribute charTermAttribute =
wdgf.getAttribute(CharTermAttribute.class);
OffsetAttribute offsetAttribute =
wdgf.getAttribute(OffsetAttribute.class);
System.out.println(charTermAttribute.toString() + " - " +
offsetAttribute.startOffset() + ',' + offsetAttribute.endOffset());
}
}
private static Reader getText() {
//return new StringReader("MayBe");
return new StringReader("May<b>Be</b>");
}
}
{code}
was:
When using the HTMLStripCharFilter before the WordDelimiterGraphFilter (or
WordDelimiterFilter - I tested with both), the stripping of html from the text
results in the inability to produce correct offsets for split tokens.
Configured with generate word parts, split on case change, and preserve
original:
Example string: "MayBe" produces these offsets (Word - start,end)
{code}
MayBe - 0,5
May - 0,3
Be - 3,5
{code}
Example string "May<b>Be</b>" produces these offsets (Word- start,end)
{code}
MayBe - 0,12
May - 0,12
Be - 0,12
{code}
Notice that 'may' and 'be' are created but the offsets are the same as the
original 'MayBe'.
I traced this down to logic within the WordDelimiterGraphFilter (and the
WordDelimiterFilter before that) to how 'hasIllegalOffsets' is calculated, as
is in the source code:
{code}
// if length by start + end offsets doesn't match the term's text then set
offsets for all our word parts/concats to the incoming
// offsets. this can happen if WDGF is applied to an injected synonym, or
to a stem'd form, etc:
hasIllegalOffsets = (savedEndOffset - savedStartOffset != savedTermLength);
{code}
Here is sample code that can show the issue:
{code}
public class TestTokenizationIssue {
public static void main(String... args) throws IOException {
HTMLStripCharFilter charFilter = new HTMLStripCharFilter(getText());
WhitespaceTokenizer whitespaceTokenizer = new WhitespaceTokenizer();
whitespaceTokenizer.setReader(charFilter);
WordDelimiterGraphFilter wdgf = new
WordDelimiterGraphFilter(whitespaceTokenizer,
WordDelimiterGraphFilter.GENERATE_WORD_PARTS |
WordDelimiterGraphFilter.SPLIT_ON_CASE_CHANGE |
WordDelimiterGraphFilter.PRESERVE_ORIGINAL,
CharArraySet.EMPTY_SET);
wdgf.reset();
while (wdgf.incrementToken()) {
CharTermAttribute charTermAttribute =
wdgf.getAttribute(CharTermAttribute.class);
OffsetAttribute offsetAttribute =
wdgf.getAttribute(OffsetAttribute.class);
System.out.println(charTermAttribute.toString() + " - " +
offsetAttribute.startOffset() + ',' + offsetAttribute.endOffset());
}
}
private static Reader getText() {
return new StringReader("MayBe");
//return new StringReader("May<b>Be</b>");
}
}
{code}
> WordDelimiter(Graph)Filter does not handle split offsets after
> HTMLStripCharFilter correctly
> --------------------------------------------------------------------------------------------
>
> Key: LUCENE-7766
> URL: https://issues.apache.org/jira/browse/LUCENE-7766
> Project: Lucene - Core
> Issue Type: Bug
> Affects Versions: 6.2.1, 6.5
> Reporter: Michael Braun
>
> When using the HTMLStripCharFilter before the WordDelimiterGraphFilter (or
> WordDelimiterFilter - I tested with both), the stripping of html from the
> text results in the inability to produce correct offsets for split tokens.
> Configured with generate word parts, split on case change, and preserve
> original:
> Example string: "MayBe" produces these offsets (Word - start,end)
> {code}
> MayBe - 0,5
> May - 0,3
> Be - 3,5
> {code}
> Example string "May<b>Be</b>" produces these offsets (Word- start,end)
> {code}
> MayBe - 0,12
> May - 0,12
> Be - 0,12
> {code}
> Notice that 'may' and 'be' are created but the offsets are the same as the
> original 'MayBe'.
> I traced this down to logic within the WordDelimiterGraphFilter (and the
> WordDelimiterFilter before that) to how 'hasIllegalOffsets' is calculated,
> as is in the source code:
> {code}
> // if length by start + end offsets doesn't match the term's text then
> set offsets for all our word parts/concats to the incoming
> // offsets. this can happen if WDGF is applied to an injected synonym,
> or to a stem'd form, etc:
> hasIllegalOffsets = (savedEndOffset - savedStartOffset !=
> savedTermLength);
> {code}
> Here is sample code that can show the issue:
> {code}
> public class TestTokenizationIssue {
> public static void main(String... args) throws IOException {
> HTMLStripCharFilter charFilter = new HTMLStripCharFilter(getText());
> WhitespaceTokenizer whitespaceTokenizer = new WhitespaceTokenizer();
> whitespaceTokenizer.setReader(charFilter);
> WordDelimiterGraphFilter wdgf = new
> WordDelimiterGraphFilter(whitespaceTokenizer,
> WordDelimiterGraphFilter.GENERATE_WORD_PARTS |
> WordDelimiterGraphFilter.SPLIT_ON_CASE_CHANGE |
> WordDelimiterGraphFilter.PRESERVE_ORIGINAL,
> CharArraySet.EMPTY_SET);
> wdgf.reset();
> while (wdgf.incrementToken()) {
> CharTermAttribute charTermAttribute =
> wdgf.getAttribute(CharTermAttribute.class);
> OffsetAttribute offsetAttribute =
> wdgf.getAttribute(OffsetAttribute.class);
> System.out.println(charTermAttribute.toString() + " - " +
> offsetAttribute.startOffset() + ',' + offsetAttribute.endOffset());
> }
> }
> private static Reader getText() {
> //return new StringReader("MayBe");
> return new StringReader("May<b>Be</b>");
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]