Cool idea. I would also be inclined to limit it to searches containing 4 or
fewer words/tokens.

My only experience is with soundex, which was invented in 1918, so I'm
probably not the one to ask. :P



Kevin Smith
Agile Coach, Wikimedia Foundation


On Thu, Jan 14, 2016 at 1:53 PM, Trey Jones <[email protected]> wrote:

> For some reason today I wanted to look up Mikhail Baryshnikov. It's been a
> while so I forgot how to spell his last name. I didn't try very hard, and I
> got no enwiki result. Google, of course, found the correct spelling, which
> I then used on enwiki.
>
> Since I used to do name searching and matching, this gave me an idea,
> which generalizes beyond just names.
>
> For every article title (and maybe each redirect—we could look into that)
> we could generate a phonetic index[1] and store those in a special
> EalasticSearch index. (We could look at storing multiple phonetic indexes
> for better recall, possibly generated by multiple algorithms; some, like
> Double Metaphone, generate multiple index by themselves.)
>
> Then, under certain circumstances (say, zero results and no suggestion
> from any other source, or no result with a score above a certain cutoff, or
> too few results, etc.), we could make a suggestion and/or show results
> based on matching phonetic index plus some score (say, a mix of page views
> and page rank, or whatever scoring we've got going on).
>
> So, when some doofus (hey, that's me!) comes along and searches for
> "borishnakoff" (worse than what I actually searched for), we could correct
> to *baryshnikov* (there's page with that title) or give *Mikhail
> Baryshnikov* as a result (likely the top scoring item with the same
> phonetic index in the title), or something similar.
>
> Other algorithms exist (and can be devised) for languages other than
> English, so the maximally fleshed out version of this would offer a choice
> of phonetic indexing algorithms, but I get ahead of myself.
>
> *Has anyone looked into this kind of phonetic indexing for enwiki,
> Wikipedia in general, or other wikimedia projects before?*
>
> I have some additional thoughts on how to test the effectiveness of
> phonetic indexing on zero results for enwiki without having to fully
> implement everything if the index sounds like something we could afford to
> build.
>
> Thoughts?
>
> —Trey
>
> [1] https://en.wikipedia.org/wiki/Phonetic_algorithm — Briefly, as an
> example, you drop non-initial vowels and duplicate letters, and collapse
> letters that tend to sound alike, while taking into account orthographic
> conventions like sh, ch, th, initial kn- or pt-, etc. So both
> *baryshnikov* and *borishnakoff* are likely to come out something like
> BRXNGV.
>
> Trey Jones
> Software Engineer, Discovery
> Wikimedia Foundation
>
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>
_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Reply via email to