[
https://issues.apache.org/jira/browse/SOLR-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749510#action_12749510
]
Anders Melchiorsen commented on SOLR-1398:
------------------------------------------
I used this slightly modified configuration from the example config:
<fieldType name="textCharNorm" class="solr.TextField"
positionIncrementGap="100" >
<analyzer>
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping.txt"/>
<tokenizer class="solr.PatternTokenizerFactory" pattern="[,;/\s]+" />
</analyzer>
</fieldType>
with the file mapping.txt containing just:
"& uuml;" => "ΓΌ"
and analyzing the string "G& uuml;nther G& uuml;nther is here" with
analysis.jsp (with verbose output) gives offsets:
5,12 13,20 21,23 24,28
while they should be:
0,12 13,25 26,28 29,33
(Note, I had to split the HTML entity into two parts to have it display in JIRA)
> PatternTokenizerFactory ignores offset corrections
> --------------------------------------------------
>
> Key: SOLR-1398
> URL: https://issues.apache.org/jira/browse/SOLR-1398
> Project: Solr
> Issue Type: Bug
> Components: Analysis
> Affects Versions: 1.4
> Reporter: Anders Melchiorsen
>
> I have an analyzer with a MappingCharFilterFactory followed by a
> PatternTokenizerFactory. This causes wrong offsets, and thus wrong highlights.
> Replacing the tokenizer with WhitespaceTokenizerFactory gives correct
> offsets, so I expect the problem to be with PatternTokenizerFactory.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.