[ 
https://issues.apache.org/jira/browse/SOLR-7926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697473#comment-14697473
 ] 

Michael McCandless commented on SOLR-7926:
------------------------------------------

I agree this (highlighting of hits from tokens produced by
EdgeNGramFilter) got worse with LUCENE-3907, but it's not clear how to
fix it.

The stacking seems more correct: all these grams are logically
interchangeable with the original token, and were derived from it, so
e.g. a phrase query involving them with adjacent tokens would work
correctly.

We could perhaps remove the token graph requirement that tokens
leaving from the same node have the same startOffset, and arriving to
the same node have the same endOffset.  Lucene would still be able to
index such a graph, as long as all tokens leaving a given node are
sorted according to their startOffset.  But I'm not sure if there
would be other problems...

Or we could maybe improve the token graph, at least for the non-edge
NGramTokenFilter, so that the grams are linked up correctly, so that any
path through the graph reconstructs the original characters.

But realistically it's not possible to innovate much with token graphs
in Lucene today because of apparently severe back compat requirements:
e.g. LUCENE-6664, which fixes the token graph bugs in the existing
SynonymFilter so that proximity queries work correctly when using
search-time synonyums, is blocked because of the back compat concerns
from LUCENE-6721.

I'm not sure what the path forward is...


> Hit highlighting with EdgeNGramFilterFactory
> --------------------------------------------
>
>                 Key: SOLR-7926
>                 URL: https://issues.apache.org/jira/browse/SOLR-7926
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>    Affects Versions: 5.1, 5.2.1
>         Environment: CentOS 7 (5.2.1), OS X 10.10.5 (5.1)
>            Reporter: Bjørn Hjelle
>            Priority: Critical
>              Labels: EdgeNGramTokenFilter, highlighting
>
> Hit highlight highlights the whole word, not just the part that matches the 
> search term when using EdgeNGramFilterFactory in the field type.
> In schema.xml I have field type text_ngram:
>                 <fieldType name="text_ngram" class="solr.TextField">
>                         <analyzer type="index">
>                                 <charFilter 
> class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>                                 <tokenizer 
> class="solr.WhitespaceTokenizerFactory"/>
>                            <!--tokenizer 
> class="solr.StandardTokenizerFactory"/-->
>                                 <filter 
> class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
> generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
> splitOnCaseChange="1"/>
>                                 <filter class="solr.LowerCaseFilterFactory"/>
>                                 <filter class="solr.EdgeNGramFilterFactory" 
> maxGramSize="20" minGramSize="3" luceneMatchVersion="4.3"/>
>                                 <filter 
> class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" 
> replacement="" replace="all"/>
>                         </analyzer>
>                         <analyzer type="query">
>                                 <charFilter 
> class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
>                                 <tokenizer 
> class="solr.StandardTokenizerFactory"/>
>                                 <filter 
> class="solr.WordDelimiterFilterFactory" generateWordParts="0" 
> generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" 
> splitOnCaseChange="0"/>
>                                 <filter class="solr.LowerCaseFilterFactory"/>
>                                 <filter 
> class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" 
> replacement="" replace="all"/>
>                                 <filter 
> class="solr.PatternReplaceFilterFactory" pattern="^(.{20})(.*)?" 
> replacement="$1" replace="all"/>
>                         </analyzer>
>                 </fieldType>
> In Solr Admin analyse, with index value "lucene" and query value "luc" it 
> shows this: 
> LENGTF text             luc         luce            lucen               lucene
>        raw_bytes        [6c 75 63]  [6c 75 63 65]   [6c 75 63 65 6e]    [6c 
> 75 63 65 6e 65]
>        start            0           0               0                   0
>        end              6           6               6                   6   
>        positionLength   1           1               1                   1    
>        type             word        word            word                word
>        position         1           1               1                   1    
> Since the end position is 6 in this case the whole word ("lucene" is 
> highlighted). 
>       
> If I change to use NGramFilterFactory it shows me this (for the first three 
> items):
> LENGTF text             luc         uce             cen               
>        raw_bytes        [6c 75 63]  [6c 75 63 65]   [6c 75 63 65 6e]    
>        start            0           1               2                 
>        end              3           4               5                   
>        positionLength   1           1               1                    
>        type             word        word            word            
>        position         1           1               1               
> The end position is correct then and the highlighter highlights only the 
> search term. Note that I have specified luceneMatchVersion="4.3". Without 
> this, the end positions goes back to 6 also for the NGramFilterFactory. 
>       



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to