I ended up with this: <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="6" side="front"/>
and it works great! It's important to specify side or the N-gram buildout is really huge. My users generally will start typing their wildcard searches left-anchored, so it was not only overkill to have all the generated stems, but was causing way too many false positives to hit. To provide some on-the-fly documentation of the above, if you have: sm333k carbon shoes the tokens generated, given my specs above, are: sm3 sm33 sm333 sm333k car carb carbo carbon sho shoe shoes For a word with 7+ characters, it would make the 4 N-grams of length 3 to 6 starting with the 1st char. It's like: for (i=3..6) { token=substr(x, 0, i); } Thanks for pointing me in this direction! On Thu, Nov 15, 2012 at 4:59 PM, Upayavira <u...@odoko.co.uk> wrote: > Remember to distinguish between recall and precision - you're likely to > get too many results, but what matters is whether the first ones are > useful. > > You could have two versions of your field, one with normal stemming, > another with n-grams, and boost the normal field above the n-gram one, > give exact matches a boost above inexact matches. > > Upayavira > > On Thu, Nov 15, 2012, at 09:48 PM, David Alyea wrote: > > OK, I tried that. Had just Snowball and EdgeNGram > > in both index and query. When I ran the "sm3 carbon" > > select, it went from 3,500 matches to 89,000! So yes, > > that edge building works! But too much. And... the > > top score matches didn't look at all like "sm3 carbon" > > products, and the shoes were no where in sight. So, > > I'll toy with it on a dev instance and see what I see. > > I definitely like the idea and I can see that N-gram > > tokens are going to behave like wildcarding. > > > > On Thu, Nov 15, 2012 at 4:13 PM, Robert Muir <rcm...@gmail.com> wrote: > > > > > On Thu, Nov 15, 2012 at 9:44 AM, David Alyea <dal...@gmail.com> wrote: > > > > > > > > to index: > > > > <filter class="solr.PorterStemFilterFactory"/> > > > > <filter class="solr.KStemFilterFactory"/> > > > > <filter class="solr.EnglishMinimalStemFilterFactory"/> > > > > > > > > to query: > > > > <filter class="solr.SnowballPorterFilterFactory" language="English" > /> > > > > > > > > > > I don't think its a good idea to use 4 different stemming algorithms > > > (porter1, kstem, plural at index-time) and porter2 at query-time. > > > This means you are analyzing terms in a totally different way at index > > > time than you are at query-time. > > > > > > Just pick one of them: make your index-time and query-time analysis > > > the same as a start and I think you will see less surprises. > > > >