Re: [discovery] Phonetic indexing

Oliver Keyes Thu, 14 Jan 2016 14:26:26 -0800

I really love this idea!

On 14 January 2016 at 14:11, Deborah Tankersley
<[email protected]> wrote:
> I was thinking about something like that earlier this week - when I was
> hearing about searching for a term in a different language (other than
> English) on the en.wikipedia site and not getting any results. Could the
> phonetic 'search' be used for that too? Do we have any idea of how many
> pages (in English and otherwise) that have the phonetic spelling for the
> main topic?
>
> Just some additional thoughts....
>
> Deb
>
>
> On Thu, Jan 14, 2016 at 2:00 PM, Kevin Smith <[email protected]> wrote:
>>
>> Cool idea. I would also be inclined to limit it to searches containing 4
>> or fewer words/tokens.
>>
>> My only experience is with soundex, which was invented in 1918, so I'm
>> probably not the one to ask. :P
>>
>>
>>
>> Kevin Smith
>> Agile Coach, Wikimedia Foundation
>>
>>
>> On Thu, Jan 14, 2016 at 1:53 PM, Trey Jones <[email protected]> wrote:
>>>
>>> For some reason today I wanted to look up Mikhail Baryshnikov. It's been
>>> a while so I forgot how to spell his last name. I didn't try very hard, and
>>> I got no enwiki result. Google, of course, found the correct spelling, which
>>> I then used on enwiki.
>>>
>>> Since I used to do name searching and matching, this gave me an idea,
>>> which generalizes beyond just names.
>>>
>>> For every article title (and maybe each redirect—we could look into that)
>>> we could generate a phonetic index[1] and store those in a special
>>> EalasticSearch index. (We could look at storing multiple phonetic indexes
>>> for better recall, possibly generated by multiple algorithms; some, like
>>> Double Metaphone, generate multiple index by themselves.)
>>>
>>> Then, under certain circumstances (say, zero results and no suggestion
>>> from any other source, or no result with a score above a certain cutoff, or
>>> too few results, etc.), we could make a suggestion and/or show results based
>>> on matching phonetic index plus some score (say, a mix of page views and
>>> page rank, or whatever scoring we've got going on).
>>>
>>> So, when some doofus (hey, that's me!) comes along and searches for
>>> "borishnakoff" (worse than what I actually searched for), we could correct
>>> to baryshnikov (there's page with that title) or give Mikhail Baryshnikov as
>>> a result (likely the top scoring item with the same phonetic index in the
>>> title), or something similar.
>>>
>>> Other algorithms exist (and can be devised) for languages other than
>>> English, so the maximally fleshed out version of this would offer a choice
>>> of phonetic indexing algorithms, but I get ahead of myself.
>>>
>>> Has anyone looked into this kind of phonetic indexing for enwiki,
>>> Wikipedia in general, or other wikimedia projects before?
>>>
>>> I have some additional thoughts on how to test the effectiveness of
>>> phonetic indexing on zero results for enwiki without having to fully
>>> implement everything if the index sounds like something we could afford to
>>> build.
>>>
>>> Thoughts?
>>>
>>> —Trey
>>>
>>> [1] https://en.wikipedia.org/wiki/Phonetic_algorithm — Briefly, as an
>>> example, you drop non-initial vowels and duplicate letters, and collapse
>>> letters that tend to sound alike, while taking into account orthographic
>>> conventions like sh, ch, th, initial kn- or pt-, etc. So both baryshnikov
>>> and borishnakoff are likely to come out something like BRXNGV.
>>>
>>> Trey Jones
>>> Software Engineer, Discovery
>>> Wikimedia Foundation
>>>
>>> _______________________________________________
>>> discovery mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>
>>
>>
>> _______________________________________________
>> discovery mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>
>
>
>
> --
> --
> Deb Tankersley
> Product Manager, Discovery
> Wikimedia Foundation
>
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>




-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Re: [discovery] Phonetic indexing

Reply via email to