Actually, helping the humans to use proper spelling is a good approach. Include a spelling correction step (non-optional) for user-generated content and spelling suggestions for queries. Completion/suggestion is another way to guide people to properly spelled words that exist in your index.
I agree that trying to fix this after you have the query is hard. If edismax supported fuzzy matching, it would be much easier. I know that, because we’ve been running that patch (SOLR-629) in prod for several years. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 9, 2020, at 4:27 AM, Erick Erickson <erickerick...@gmail.com> wrote: > > Anything you do will be wrong ;). > > I suppose you could kick out words that weren’t in some dictionary and > accumulate a list of words not in the dictionary and just deal with them > “somehow", but that’s labor-intensive since you then have to deal with proper > names and the like. Sometimes you can get by with ignoring words with _only_ > the first letter capitalized, which is also not perfect but might get you > closer. You mentioned phonetic filters, but frankly I have no idea whether > YES and YYYYYYEEEEEEEESSSSSSSS would reduce to the same code, I rather doubt > it. > > In general, you _can’t_ solve this problem perfectly without inspecting each > input, you can only get an approximation. And at some point it’s worth asking > “is it worth it?”. I suppose you could try the regex Andy suggested in a > copyField destination and use that as well as the primary field in queries, > that might help at least find things like this. > > If we were just able to require humans to use proper spelling, this would be > a lot easier…. > > Wish there were a solution > > Best, > Erick > >> On Oct 8, 2020, at 10:59 PM, Mike Drob <md...@mdrob.com> wrote: >> >> I was thinking about that, but there are words that are legitimately >> different with repeated consonants. My primary school teacher lost hair >> over getting us to learn the difference between desert and dessert. >> >> Maybe we need something that can borrow the boosting behaviour of fuzzy >> query - match the exact term, but also the neighbors with a slight deboost, >> so that if the main term exists those others won't show up. >> >> On Thu, Oct 8, 2020 at 5:46 PM Andy Webb <andywebb1...@gmail.com> wrote: >> >>> How about something like this? >>> >>> { >>> "add-field-type": [ >>> { >>> "name": "norepeat", >>> "class": "solr.TextField", >>> "analyzer": { >>> "tokenizer": { >>> "class": "solr.StandardTokenizerFactory" >>> }, >>> "filters": [ >>> { >>> "class": "solr.LowerCaseFilterFactory" >>> }, >>> { >>> "class": "solr.PatternReplaceFilterFactory", >>> "pattern": "(.)\\1+", >>> "replacement": "$1" >>> } >>> ] >>> } >>> } >>> ] >>> } >>> >>> This finds a match... >>> >>> http://localhost:8983/solr/#/norepeat/analysis?analysis.fieldvalue=Yes&analysis.query=yyyyYyyyyyyeeEssSsssss&analysis.fieldtype=norepeat >>> >>> Andy >>> >>> >>> >>> On Thu, 8 Oct 2020 at 23:02, Mike Drob <md...@mdrob.com> wrote: >>> >>>> I'm looking for a way to transform words with repeated letters into the >>>> same token - does something like this exist out of the box? Do our >>> stemmers >>>> support it? >>>> >>>> For example, say I would want all of these terms to return the same >>> search >>>> results: >>>> >>>> YES >>>> YESSS >>>> YYYEEESSS >>>> YYEESSSS[...]S >>>> >>>> I don't know how long a user would hold down the S key at the end to >>>> capture their level of excitement, and I don't want to manually define >>>> synonyms for every length. >>>> >>>> I'm pretty sure that I don't want PhoneticFilter here, maybe >>>> PatternReplace? Not a huge fan of how that one is configured, and I think >>>> I'd have to set up a bunch of patterns inline for it? >>>> >>>> Mike >>>> >>> >