[
https://issues.apache.org/jira/browse/LUCENE-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Braun updated LUCENE-7795:
----------------------------------
Description:
This problem is not present in WordDelimiterGraphFilter, but it is present in
WordDelimiterFilter's interaction with HTMLStripCharFilter.
Test code:
{code}
public class TestTokenizationIssue2 {
public static void main(String... args) throws IOException {
HTMLStripCharFilter charFilter = new HTMLStripCharFilter(getText());
WhitespaceTokenizer whitespaceTokenizer = new WhitespaceTokenizer();
whitespaceTokenizer.setReader(charFilter);
// WordDelimiterGraphFilter wdgf = new
WordDelimiterGraphFilter(whitespaceTokenizer,
// WordDelimiterGraphFilter.GENERATE_WORD_PARTS,
CharArraySet.EMPTY_SET);
WordDelimiterFilter wdgf = new WordDelimiterFilter(whitespaceTokenizer,
WordDelimiterFilter.GENERATE_WORD_PARTS, CharArraySet.EMPTY_SET);
wdgf.reset();
while (wdgf.incrementToken()) {
CharTermAttribute charTermAttribute =
wdgf.getAttribute(CharTermAttribute.class);
OffsetAttribute offsetAttribute =
wdgf.getAttribute(OffsetAttribute.class);
System.out.println(charTermAttribute.toString() + " - " +
offsetAttribute.startOffset() + ',' + offsetAttribute.endOffset());
}
}
private static Reader getText() {
return new StringReader("“Risk");
}
}
{code}
The offsets produced by the WordDelimiterFilter are 1,10. With
WordDelimiterGraphFilter the offsets produced are 0,10. It should be 0,10 as
this is the original text: {noformat}“Risk{noformat} - and 1 is
between the ampersand and hash.
Inside WordDelimiterFilter, I believe the conditional branch from "if
(isSingleWord && startOffset <= savedEndOffset) " is invalid and it should
always use the saved start and end offsets because it can't make the assertion
that the iterator's current and end are reliable markers.
was:
This problem is not present in WordDelimiterGraphFilter, but it is present in
WordDelimiterFilter's interaction with HTMLStripCharFilter.
Test code:
{code}
public class TestTokenizationIssue2 {
public static void main(String... args) throws IOException {
HTMLStripCharFilter charFilter = new HTMLStripCharFilter(getText());
WhitespaceTokenizer whitespaceTokenizer = new WhitespaceTokenizer();
whitespaceTokenizer.setReader(charFilter);
// WordDelimiterGraphFilter wdgf = new
WordDelimiterGraphFilter(whitespaceTokenizer,
// WordDelimiterGraphFilter.GENERATE_WORD_PARTS,
CharArraySet.EMPTY_SET);
WordDelimiterFilter wdgf = new WordDelimiterFilter(whitespaceTokenizer,
WordDelimiterFilter.GENERATE_WORD_PARTS, CharArraySet.EMPTY_SET);
wdgf.reset();
while (wdgf.incrementToken()) {
CharTermAttribute charTermAttribute =
wdgf.getAttribute(CharTermAttribute.class);
OffsetAttribute offsetAttribute =
wdgf.getAttribute(OffsetAttribute.class);
System.out.println(charTermAttribute.toString() + " - " +
offsetAttribute.startOffset() + ',' + offsetAttribute.endOffset());
}
}
private static Reader getText() {
return new StringReader("“Risk");
}
}
{code}
The offsets produced by the WordDelimiterFilter are 1,10. With
WordDelimiterGraphFilter the offsets produced are 0,10. It should be 0,10 as
this is the original text: “Risk - and 1 is between the ampersand and
hash.
Inside WordDelimiterFilter, I believe the conditional branch from "if
(isSingleWord && startOffset <= savedEndOffset) " is invalid and it should
always use the saved start and end offsets because it can't make the assertion
that the iterator's current and end are reliable markers.
> WordDelimiterFilter produces invalid offsets in single word case
> ----------------------------------------------------------------
>
> Key: LUCENE-7795
> URL: https://issues.apache.org/jira/browse/LUCENE-7795
> Project: Lucene - Core
> Issue Type: Bug
> Affects Versions: master (7.0), 6.5
> Reporter: Michael Braun
>
> This problem is not present in WordDelimiterGraphFilter, but it is present in
> WordDelimiterFilter's interaction with HTMLStripCharFilter.
> Test code:
> {code}
> public class TestTokenizationIssue2 {
> public static void main(String... args) throws IOException {
> HTMLStripCharFilter charFilter = new HTMLStripCharFilter(getText());
> WhitespaceTokenizer whitespaceTokenizer = new WhitespaceTokenizer();
> whitespaceTokenizer.setReader(charFilter);
> // WordDelimiterGraphFilter wdgf = new
> WordDelimiterGraphFilter(whitespaceTokenizer,
> // WordDelimiterGraphFilter.GENERATE_WORD_PARTS,
> CharArraySet.EMPTY_SET);
> WordDelimiterFilter wdgf = new
> WordDelimiterFilter(whitespaceTokenizer,
> WordDelimiterFilter.GENERATE_WORD_PARTS,
> CharArraySet.EMPTY_SET);
> wdgf.reset();
> while (wdgf.incrementToken()) {
> CharTermAttribute charTermAttribute =
> wdgf.getAttribute(CharTermAttribute.class);
> OffsetAttribute offsetAttribute =
> wdgf.getAttribute(OffsetAttribute.class);
> System.out.println(charTermAttribute.toString() + " - " +
> offsetAttribute.startOffset() + ',' + offsetAttribute.endOffset());
> }
> }
> private static Reader getText() {
> return new StringReader("“Risk");
> }
> }
> {code}
> The offsets produced by the WordDelimiterFilter are 1,10. With
> WordDelimiterGraphFilter the offsets produced are 0,10. It should be 0,10 as
> this is the original text: {noformat}“Risk{noformat} - and 1 is
> between the ampersand and hash.
> Inside WordDelimiterFilter, I believe the conditional branch from "if
> (isSingleWord && startOffset <= savedEndOffset) " is invalid and it should
> always use the saved start and end offsets because it can't make the
> assertion that the iterator's current and end are reliable markers.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]