Re: [discovery] Phonetic indexing

Deborah Tankersley Thu, 14 Jan 2016 14:12:07 -0800

I was thinking about something like that earlier this week - when I was
hearing about searching for a term in a different language (other than
English) on the en.wikipedia site and not getting any results. Could the
phonetic 'search' be used for that too? Do we have any idea of how many
pages (in English and otherwise) that have the phonetic spelling for the
main topic?


Just some additional thoughts....

Deb


On Thu, Jan 14, 2016 at 2:00 PM, Kevin Smith <[email protected]> wrote:

> Cool idea. I would also be inclined to limit it to searches containing 4
> or fewer words/tokens.
>
> My only experience is with soundex, which was invented in 1918, so I'm
> probably not the one to ask. :P
>
>
>
> Kevin Smith
> Agile Coach, Wikimedia Foundation
>
>
> On Thu, Jan 14, 2016 at 1:53 PM, Trey Jones <[email protected]> wrote:
>
>> For some reason today I wanted to look up Mikhail Baryshnikov. It's been
>> a while so I forgot how to spell his last name. I didn't try very hard, and
>> I got no enwiki result. Google, of course, found the correct spelling,
>> which I then used on enwiki.
>>
>> Since I used to do name searching and matching, this gave me an idea,
>> which generalizes beyond just names.
>>
>> For every article title (and maybe each redirect—we could look into that)
>> we could generate a phonetic index[1] and store those in a special
>> EalasticSearch index. (We could look at storing multiple phonetic indexes
>> for better recall, possibly generated by multiple algorithms; some, like
>> Double Metaphone, generate multiple index by themselves.)
>>
>> Then, under certain circumstances (say, zero results and no suggestion
>> from any other source, or no result with a score above a certain cutoff, or
>> too few results, etc.), we could make a suggestion and/or show results
>> based on matching phonetic index plus some score (say, a mix of page views
>> and page rank, or whatever scoring we've got going on).
>>
>> So, when some doofus (hey, that's me!) comes along and searches for
>> "borishnakoff" (worse than what I actually searched for), we could correct
>> to *baryshnikov* (there's page with that title) or give *Mikhail
>> Baryshnikov* as a result (likely the top scoring item with the same
>> phonetic index in the title), or something similar.
>>
>> Other algorithms exist (and can be devised) for languages other than
>> English, so the maximally fleshed out version of this would offer a choice
>> of phonetic indexing algorithms, but I get ahead of myself.
>>
>> *Has anyone looked into this kind of phonetic indexing for enwiki,
>> Wikipedia in general, or other wikimedia projects before?*
>>
>> I have some additional thoughts on how to test the effectiveness of
>> phonetic indexing on zero results for enwiki without having to fully
>> implement everything if the index sounds like something we could afford to
>> build.
>>
>> Thoughts?
>>
>> —Trey
>>
>> [1] https://en.wikipedia.org/wiki/Phonetic_algorithm — Briefly, as an
>> example, you drop non-initial vowels and duplicate letters, and collapse
>> letters that tend to sound alike, while taking into account orthographic
>> conventions like sh, ch, th, initial kn- or pt-, etc. So both
>> *baryshnikov* and *borishnakoff* are likely to come out something like
>> BRXNGV.
>>
>> Trey Jones
>> Software Engineer, Discovery
>> Wikimedia Foundation
>>
>> _______________________________________________
>> discovery mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>
>>
>
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>


-- 
-- 
Deb Tankersley
Product Manager, Discovery
Wikimedia Foundation

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Re: [discovery] Phonetic indexing

Reply via email to