[ 
https://issues.apache.org/jira/browse/SOLR-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749510#action_12749510
 ] 

Anders Melchiorsen commented on SOLR-1398:
------------------------------------------

I used this slightly modified configuration from the example config:

    <fieldType name="textCharNorm" class="solr.TextField" 
positionIncrementGap="100" >
      <analyzer>
        <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping.txt"/>
        <tokenizer class="solr.PatternTokenizerFactory" pattern="[,;/\s]+" />
      </analyzer>
    </fieldType>

with the file mapping.txt containing just:

    "& uuml;" => "ΓΌ"

and analyzing the string "G& uuml;nther G& uuml;nther is here" with 
analysis.jsp (with verbose output) gives offsets:

    5,12        13,20   21,23   24,28

while they should be:

    0,12        13,25   26,28   29,33

(Note, I had to split the HTML entity into two parts to have it display in JIRA)


> PatternTokenizerFactory ignores offset corrections
> --------------------------------------------------
>
>                 Key: SOLR-1398
>                 URL: https://issues.apache.org/jira/browse/SOLR-1398
>             Project: Solr
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.4
>            Reporter: Anders Melchiorsen
>
> I have an analyzer with a MappingCharFilterFactory followed by a 
> PatternTokenizerFactory. This causes wrong offsets, and thus wrong highlights.
> Replacing the tokenizer with WhitespaceTokenizerFactory gives correct 
> offsets, so I expect the problem to be with PatternTokenizerFactory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to