[jira] [Commented] (SOLR-5332) Add "preserve original" setting to the EdgeNGramFilterFactory

Simon Endele (JIRA) Mon, 02 Mar 2015 09:26:36 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343414#comment-14343414
 ]


Simon Endele commented on SOLR-5332:
------------------------------------

+1 for this feature.
We use the EdgeNGramFilterFactory on a tokenized field (in order to implement a 
"prefix search" on index time) with minGramSize="3".
Unfortunately we observed that tokens with length 1 or 2 are actually deleted, 
unexpectedly from our point of view.

Using a second field (though complicated IMHO) would address query-issues, but 
it gets awkward when it comes to highlighting or phrase searches.
For instance when searching for "us rep"
- the field with EdgeNGramFilterFactory highlights "rep" in "representative", 
but not "US" as this token has been removed,
- the field without EdgeNGramFilterFactory highlights "US", but not 
"representative" as it has no prefixes indexed.

Bringing these highlightings together in one string is a quite complex task.
Not speaking of a phrase search, which does not work at all for the example 
above.

We use minGramSize="3" to reduce collisions of prefixes and abbreviations (like 
"US" and "usage") and reduce the index size.
I admit, this does not prevent all collisions (e.g. "USA" still collides with 
"usage"), but it's a compromise.

Nevertheless, minGramSize is a nice feature of EdgeNGramFilterFactory, but it 
lacks a "preserveOriginal" flag IMO.

> Add "preserve original" setting to the EdgeNGramFilterFactory
> -------------------------------------------------------------
>
>                 Key: SOLR-5332
>                 URL: https://issues.apache.org/jira/browse/SOLR-5332
>             Project: Solr
>          Issue Type: Wish
>    Affects Versions: 4.4, 4.5, 4.5.1, 4.6
>            Reporter: Alexander S.
>
> Hi, as described here: 
> http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-td4086967.html
>  the problem is in that if you have these 2 strings to index:
> 1. facebook.com/someuser.1
> 2. facebook.com/someveryandverylongusername
> and the edge ngram filter factory with min and max gram size settings 2 and 
> 25, search requests for these urls will fail.
> But search requests for:
> 1. facebook.com/someuser
> 2. facebook.com/someveryandverylonguserna
> will work properly.
> It's because first url has "1" at the end, which is lover than the allowed 
> min gram size. In the second url the user name is longer than the max gram 
> size (27 characters).
> Would be good to have a "preserve original" option, that will add the 
> original string to the index if it does not fit the allowed gram size, so 
> that "1" and "someveryandverylongusername" tokens will also be added to the 
> index.
> Best,
> Alex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-5332) Add "preserve original" setting to the EdgeNGramFilterFactory

Reply via email to