Re: Folding Repeated Letters

Walter Underwood Fri, 09 Oct 2020 09:30:36 -0700

Actually, helping the humans to use proper spelling is a good approach. Include 
a
spelling correction step (non-optional) for user-generated content and spelling
suggestions for queries. Completion/suggestion is another way to guide people
to properly spelled words that exist in your index.


I agree that trying to fix this after you have the query is hard.

If edismax supported fuzzy matching, it would be much easier. I know that, 
because
we’ve been running that patch (SOLR-629) in prod for several years.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 9, 2020, at 4:27 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> Anything you do will be wrong ;).
> 
> I suppose you could kick out words that weren’t in some dictionary and 
> accumulate a list of words not in the dictionary and just deal with them 
> “somehow", but that’s labor-intensive since you then have to deal with proper 
> names and the like. Sometimes you can get by with ignoring words with _only_ 
> the first letter capitalized, which is also not perfect but might get you 
> closer. You mentioned phonetic filters, but frankly I have no idea whether 
> YES and YYYYYYEEEEEEEESSSSSSSS would reduce to the same code, I rather doubt 
> it.
> 
> In general, you _can’t_ solve this problem perfectly without inspecting each 
> input, you can only get an approximation. And at some point it’s worth asking 
> “is it worth it?”. I suppose you could try the regex Andy suggested in a 
> copyField destination and use that as well as the primary field in queries, 
> that might help at least find things like this.
> 
> If we were just able to require humans to use proper spelling, this would be 
> a lot easier….
> 
> Wish there were a solution
> 
> Best,
> Erick
> 
>> On Oct 8, 2020, at 10:59 PM, Mike Drob <md...@mdrob.com> wrote:
>> 
>> I was thinking about that, but there are words that are legitimately
>> different with repeated consonants. My primary school teacher lost hair
>> over getting us to learn the difference between desert and dessert.
>> 
>> Maybe we need something that can borrow the boosting behaviour of fuzzy
>> query - match the exact term, but also the neighbors with a slight deboost,
>> so that if the main term exists those others won't show up.
>> 
>> On Thu, Oct 8, 2020 at 5:46 PM Andy Webb <andywebb1...@gmail.com> wrote:
>> 
>>> How about something like this?
>>> 
>>> {
>>>   "add-field-type": [
>>>       {
>>>           "name": "norepeat",
>>>           "class": "solr.TextField",
>>>           "analyzer": {
>>>               "tokenizer": {
>>>                   "class": "solr.StandardTokenizerFactory"
>>>               },
>>>               "filters": [
>>>                   {
>>>                       "class": "solr.LowerCaseFilterFactory"
>>>                   },
>>>                   {
>>>                       "class": "solr.PatternReplaceFilterFactory",
>>>                       "pattern": "(.)\\1+",
>>>                       "replacement": "$1"
>>>                   }
>>>               ]
>>>           }
>>>       }
>>>   ]
>>> }
>>> 
>>> This finds a match...
>>> 
>>> http://localhost:8983/solr/#/norepeat/analysis?analysis.fieldvalue=Yes&analysis.query=yyyyYyyyyyyeeEssSsssss&analysis.fieldtype=norepeat
>>> 
>>> Andy
>>> 
>>> 
>>> 
>>> On Thu, 8 Oct 2020 at 23:02, Mike Drob <md...@mdrob.com> wrote:
>>> 
>>>> I'm looking for a way to transform words with repeated letters into the
>>>> same token - does something like this exist out of the box? Do our
>>> stemmers
>>>> support it?
>>>> 
>>>> For example, say I would want all of these terms to return the same
>>> search
>>>> results:
>>>> 
>>>> YES
>>>> YESSS
>>>> YYYEEESSS
>>>> YYEESSSS[...]S
>>>> 
>>>> I don't know how long a user would hold down the S key at the end to
>>>> capture their level of excitement, and I don't want to manually define
>>>> synonyms for every length.
>>>> 
>>>> I'm pretty sure that I don't want PhoneticFilter here, maybe
>>>> PatternReplace? Not a huge fan of how that one is configured, and I think
>>>> I'd have to set up a bunch of patterns inline for it?
>>>> 
>>>> Mike
>>>> 
>>> 
>

Re: Folding Repeated Letters

Reply via email to