[
https://issues.apache.org/jira/browse/SOLR-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343414#comment-14343414
]
Simon Endele commented on SOLR-5332:
------------------------------------
+1 for this feature.
We use the EdgeNGramFilterFactory on a tokenized field (in order to implement a
"prefix search" on index time) with minGramSize="3".
Unfortunately we observed that tokens with length 1 or 2 are actually deleted,
unexpectedly from our point of view.
Using a second field (though complicated IMHO) would address query-issues, but
it gets awkward when it comes to highlighting or phrase searches.
For instance when searching for "us rep"
- the field with EdgeNGramFilterFactory highlights "rep" in "representative",
but not "US" as this token has been removed,
- the field without EdgeNGramFilterFactory highlights "US", but not
"representative" as it has no prefixes indexed.
Bringing these highlightings together in one string is a quite complex task.
Not speaking of a phrase search, which does not work at all for the example
above.
We use minGramSize="3" to reduce collisions of prefixes and abbreviations (like
"US" and "usage") and reduce the index size.
I admit, this does not prevent all collisions (e.g. "USA" still collides with
"usage"), but it's a compromise.
Nevertheless, minGramSize is a nice feature of EdgeNGramFilterFactory, but it
lacks a "preserveOriginal" flag IMO.
> Add "preserve original" setting to the EdgeNGramFilterFactory
> -------------------------------------------------------------
>
> Key: SOLR-5332
> URL: https://issues.apache.org/jira/browse/SOLR-5332
> Project: Solr
> Issue Type: Wish
> Affects Versions: 4.4, 4.5, 4.5.1, 4.6
> Reporter: Alexander S.
>
> Hi, as described here:
> http://lucene.472066.n3.nabble.com/Help-to-figure-out-why-query-does-not-match-td4086967.html
> the problem is in that if you have these 2 strings to index:
> 1. facebook.com/someuser.1
> 2. facebook.com/someveryandverylongusername
> and the edge ngram filter factory with min and max gram size settings 2 and
> 25, search requests for these urls will fail.
> But search requests for:
> 1. facebook.com/someuser
> 2. facebook.com/someveryandverylonguserna
> will work properly.
> It's because first url has "1" at the end, which is lover than the allowed
> min gram size. In the second url the user name is longer than the max gram
> size (27 characters).
> Would be good to have a "preserve original" option, that will add the
> original string to the index if it does not fit the allowed gram size, so
> that "1" and "someveryandverylongusername" tokens will also be added to the
> index.
> Best,
> Alex
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]