For some reason today I wanted to look up Mikhail Baryshnikov. It's been a
while so I forgot how to spell his last name. I didn't try very hard, and I
got no enwiki result. Google, of course, found the correct spelling, which
I then used on enwiki.

Since I used to do name searching and matching, this gave me an idea, which
generalizes beyond just names.

For every article title (and maybe each redirect—we could look into that)
we could generate a phonetic index[1] and store those in a special
EalasticSearch index. (We could look at storing multiple phonetic indexes
for better recall, possibly generated by multiple algorithms; some, like
Double Metaphone, generate multiple index by themselves.)

Then, under certain circumstances (say, zero results and no suggestion from
any other source, or no result with a score above a certain cutoff, or too
few results, etc.), we could make a suggestion and/or show results based on
matching phonetic index plus some score (say, a mix of page views and page
rank, or whatever scoring we've got going on).

So, when some doofus (hey, that's me!) comes along and searches for
"borishnakoff" (worse than what I actually searched for), we could correct
to *baryshnikov* (there's page with that title) or give *Mikhail
Baryshnikov* as a result (likely the top scoring item with the same
phonetic index in the title), or something similar.

Other algorithms exist (and can be devised) for languages other than
English, so the maximally fleshed out version of this would offer a choice
of phonetic indexing algorithms, but I get ahead of myself.

*Has anyone looked into this kind of phonetic indexing for enwiki,
Wikipedia in general, or other wikimedia projects before?*

I have some additional thoughts on how to test the effectiveness of
phonetic indexing on zero results for enwiki without having to fully
implement everything if the index sounds like something we could afford to
build.

Thoughts?

—Trey

[1] https://en.wikipedia.org/wiki/Phonetic_algorithm — Briefly, as an
example, you drop non-initial vowels and duplicate letters, and collapse
letters that tend to sound alike, while taking into account orthographic
conventions like sh, ch, th, initial kn- or pt-, etc. So both *baryshnikov*
and *borishnakoff* are likely to come out something like BRXNGV.

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Reply via email to