Dawid Weiss created LUCENE-7267:
-----------------------------------

             Summary: Field with an explicit TokenStream must be tokenized and 
then uses the default Analyzer offset gaps
                 Key: LUCENE-7267
                 URL: https://issues.apache.org/jira/browse/LUCENE-7267
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Dawid Weiss
            Priority: Minor


This took me somewhat by surprise. We have a pretty complex code that uses 
fields with explicit token streams (which provide their own offset data) and 
multivalues.

It was surprising to see that offsets for subsequent values were shifted by 1 
compared to what was explicitly provided in the OffsetAttribute. A bit of 
debugging showed this code inside {{PerField.invert}}:

{code}
      if (analyzed) {
        invertState.position += 
docState.analyzer.getPositionIncrementGap(fieldInfo.name);
        invertState.offset += docState.analyzer.getOffsetGap(fieldInfo.name);
      }
{code}

A field with an explicit token stream must still be declared as tokenized and 
PerField then thinks that this field must have come from an analyzer (where in 
fact it didn't):

{code}
      final boolean analyzed = fieldType.tokenized() && docState.analyzer != 
null;
{code}

While the default position increment is 0, the default offset gap isn't -- it's 
1, causing the shift.

Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to