[jira] [Commented] (SOLR-7926) Hit highlighting with EdgeNGramFilterFactory

David Smiley (JIRA) Fri, 14 Aug 2015 14:31:15 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-7926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697791#comment-14697791
 ]


David Smiley commented on SOLR-7926:
------------------------------------

bq. We could perhaps remove the token graph requirement that tokens leaving 
from the same node have the same startOffset, and arriving to the same node 
have the same endOffset. Lucene would still be able to index such a graph, as 
long as all tokens leaving a given node are sorted according to their 
startOffset. But I'm not sure if there would be other problems...

I think that direction seems best to me.  But yes there may be problems we 
don't see yet that may show up once we try.  Not a blocker; just some unknown 
unknowns that will become known :-)

> Hit highlighting with EdgeNGramFilterFactory
> --------------------------------------------
>
>                 Key: SOLR-7926
>                 URL: https://issues.apache.org/jira/browse/SOLR-7926
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>    Affects Versions: 5.1, 5.2.1
>         Environment: CentOS 7 (5.2.1), OS X 10.10.5 (5.1)
>            Reporter: Bjørn Hjelle
>            Priority: Critical
>              Labels: EdgeNGramTokenFilter, highlighting
>
> Hit highlight highlights the whole word, not just the part that matches the 
> search term when using EdgeNGramFilterFactory in the field type.
> In schema.xml I have field type text_ngram:
>                 <fieldType name="text_ngram" class="solr.TextField">
>                         <analyzer type="index">
>                                 <charFilter 
> class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>                                 <tokenizer 
> class="solr.WhitespaceTokenizerFactory"/>
>                            <!--tokenizer 
> class="solr.StandardTokenizerFactory"/-->
>                                 <filter 
> class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
> generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
> splitOnCaseChange="1"/>
>                                 <filter class="solr.LowerCaseFilterFactory"/>
>                                 <filter class="solr.EdgeNGramFilterFactory" 
> maxGramSize="20" minGramSize="3" luceneMatchVersion="4.3"/>
>                                 <filter 
> class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" 
> replacement="" replace="all"/>
>                         </analyzer>
>                         <analyzer type="query">
>                                 <charFilter 
> class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>                                 <tokenizer 
> class="solr.StandardTokenizerFactory"/>
>                                 <filter 
> class="solr.WordDelimiterFilterFactory" generateWordParts="0" 
> generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" 
> splitOnCaseChange="0"/>
>                                 <filter class="solr.LowerCaseFilterFactory"/>
>                                 <filter 
> class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" 
> replacement="" replace="all"/>
>                                 <filter 
> class="solr.PatternReplaceFilterFactory" pattern="^(.{20})(.*)?" 
> replacement="$1" replace="all"/>
>                         </analyzer>
>                 </fieldType>
> In Solr Admin analyse, with index value "lucene" and query value "luc" it 
> shows this: 
> LENGTF text             luc         luce            lucen               lucene
>        raw_bytes        [6c 75 63]  [6c 75 63 65]   [6c 75 63 65 6e]    [6c 
> 75 63 65 6e 65]
>        start            0           0               0                   0
>        end              6           6               6                   6   
>        positionLength   1           1               1                   1    
>        type             word        word            word                word
>        position         1           1               1                   1    
> Since the end position is 6 in this case the whole word ("lucene" is 
> highlighted). 
>       
> If I change to use NGramFilterFactory it shows me this (for the first three 
> items):
> LENGTF text             luc         uce             cen               
>        raw_bytes        [6c 75 63]  [6c 75 63 65]   [6c 75 63 65 6e]    
>        start            0           1               2                 
>        end              3           4               5                   
>        positionLength   1           1               1                    
>        type             word        word            word            
>        position         1           1               1               
> The end position is correct then and the highlighter highlights only the 
> search term. Note that I have specified luceneMatchVersion="4.3". Without 
> this, the end positions goes back to 6 also for the NGramFilterFactory. 
>       



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-7926) Hit highlighting with EdgeNGramFilterFactory

Reply via email to