[jira] [Commented] (SOLR-7926) Hit highlighting with EdgeNGramFilterFactory

JIRA Fri, 14 Aug 2015 07:06:08 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-7926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697053#comment-14697053
 ]


Jan Høydahl commented on SOLR-7926:
-----------------------------------

Hi. 

This kind of questions is more suited for the solr-user mailing list. Most 
likely this is not a bug. Please ask the question on the list, and also tell 
which highlighter implementation you use, with what configuration, and why you 
expect it to do what you want (refer to documentation)? I'll close this jira as 
"Invalid".

If it ends up being a suspected bug or you find out your wanted result is not 
easily configurable with any of the existing highlighter implementations, then 
please re-open.

> Hit highlighting with EdgeNGramFilterFactory
> --------------------------------------------
>
>                 Key: SOLR-7926
>                 URL: https://issues.apache.org/jira/browse/SOLR-7926
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>    Affects Versions: 5.1, 5.2.1
>         Environment: CentOS 7 (5.2.1), OS X 10.10.5 (5.1)
>            Reporter: Bjørn Hjelle
>            Priority: Critical
>              Labels: EdgeNGramTokenFilter, highlighting
>
> Hit highlight highlights the whole word, not just the part that matches the 
> search term when using EdgeNGramFilterFactory in the field type.
> In schema.xml I have field type text_ngram:
>                 <fieldType name="text_ngram" class="solr.TextField">
>                         <analyzer type="index">
>                                 <charFilter 
> class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>                                 <tokenizer 
> class="solr.WhitespaceTokenizerFactory"/>
>                            <!--tokenizer 
> class="solr.StandardTokenizerFactory"/-->
>                                 <filter 
> class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
> generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
> splitOnCaseChange="1"/>
>                                 <filter class="solr.LowerCaseFilterFactory"/>
>                                 <filter class="solr.EdgeNGramFilterFactory" 
> maxGramSize="20" minGramSize="3" luceneMatchVersion="4.3"/>
>                                 <filter 
> class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" 
> replacement="" replace="all"/>
>                         </analyzer>
>                         <analyzer type="query">
>                                 <charFilter 
> class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>                                 <tokenizer 
> class="solr.StandardTokenizerFactory"/>
>                                 <filter 
> class="solr.WordDelimiterFilterFactory" generateWordParts="0" 
> generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" 
> splitOnCaseChange="0"/>
>                                 <filter class="solr.LowerCaseFilterFactory"/>
>                                 <filter 
> class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" 
> replacement="" replace="all"/>
>                                 <filter 
> class="solr.PatternReplaceFilterFactory" pattern="^(.{20})(.*)?" 
> replacement="$1" replace="all"/>
>                         </analyzer>
>                 </fieldType>
> In Solr Admin analyse, with index value "lucene" and query value "luc" it 
> shows this: 
> LENGTF text             luc         luce            lucen               lucene
>        raw_bytes        [6c 75 63]  [6c 75 63 65]   [6c 75 63 65 6e]    [6c 
> 75 63 65 6e 65]
>        start            0           0               0                   0
>        end              6           6               6                   6   
>        positionLength   1           1               1                   1    
>        type             word        word            word                word
>        position         1           1               1                   1    
> Since the end position is 6 in this case the whole word ("lucene" is 
> highlighted). 
>       
> If I change to use NGramFilterFactory it shows me this (for the first three 
> items):
> LENGTF text             luc         uce             cen               
>        raw_bytes        [6c 75 63]  [6c 75 63 65]   [6c 75 63 65 6e]    
>        start            0           1               2                 
>        end              3           4               5                   
>        positionLength   1           1               1                    
>        type             word        word            word            
>        position         1           1               1               
> The end position is correct then and the highlighter highlights only the 
> search term. Note that I have specified luceneMatchVersion="4.3". Without 
> this, the end positions goes back to 6 also for the NGramFilterFactory. 
>       



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-7926) Hit highlighting with EdgeNGramFilterFactory

Reply via email to