Cool idea. I would also be inclined to limit it to searches containing 4 or fewer words/tokens.
My only experience is with soundex, which was invented in 1918, so I'm probably not the one to ask. :P Kevin Smith Agile Coach, Wikimedia Foundation On Thu, Jan 14, 2016 at 1:53 PM, Trey Jones <[email protected]> wrote: > For some reason today I wanted to look up Mikhail Baryshnikov. It's been a > while so I forgot how to spell his last name. I didn't try very hard, and I > got no enwiki result. Google, of course, found the correct spelling, which > I then used on enwiki. > > Since I used to do name searching and matching, this gave me an idea, > which generalizes beyond just names. > > For every article title (and maybe each redirect—we could look into that) > we could generate a phonetic index[1] and store those in a special > EalasticSearch index. (We could look at storing multiple phonetic indexes > for better recall, possibly generated by multiple algorithms; some, like > Double Metaphone, generate multiple index by themselves.) > > Then, under certain circumstances (say, zero results and no suggestion > from any other source, or no result with a score above a certain cutoff, or > too few results, etc.), we could make a suggestion and/or show results > based on matching phonetic index plus some score (say, a mix of page views > and page rank, or whatever scoring we've got going on). > > So, when some doofus (hey, that's me!) comes along and searches for > "borishnakoff" (worse than what I actually searched for), we could correct > to *baryshnikov* (there's page with that title) or give *Mikhail > Baryshnikov* as a result (likely the top scoring item with the same > phonetic index in the title), or something similar. > > Other algorithms exist (and can be devised) for languages other than > English, so the maximally fleshed out version of this would offer a choice > of phonetic indexing algorithms, but I get ahead of myself. > > *Has anyone looked into this kind of phonetic indexing for enwiki, > Wikipedia in general, or other wikimedia projects before?* > > I have some additional thoughts on how to test the effectiveness of > phonetic indexing on zero results for enwiki without having to fully > implement everything if the index sounds like something we could afford to > build. > > Thoughts? > > —Trey > > [1] https://en.wikipedia.org/wiki/Phonetic_algorithm — Briefly, as an > example, you drop non-initial vowels and duplicate letters, and collapse > letters that tend to sound alike, while taking into account orthographic > conventions like sh, ch, th, initial kn- or pt-, etc. So both > *baryshnikov* and *borishnakoff* are likely to come out something like > BRXNGV. > > Trey Jones > Software Engineer, Discovery > Wikimedia Foundation > > _______________________________________________ > discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery > >
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
