Re: [discovery] Phonetic indexing

Deborah Tankersley Thu, 14 Jan 2016 15:08:24 -0800

Fuzzy matching
<https://en.wikipedia.org/wiki/Fuzzy_matching_(computer-assisted_translation)>
FTW? ;)


On Thu, Jan 14, 2016 at 2:51 PM, Trey Jones <[email protected]> wrote:

> There are lots of possible implementations of phonetic searching. Limiting
> based on query term count would save lots of overhead, and limiting it to
> terms that aren't in the index (or have very very low counts) could work,
> too. These are things we could test beforehand, to see what the expense and
> benefit of computing various things work out to be.
>
> Soundex *is* pretty old, but it works okay. It's easily modified to be a
> bit smarter, too. The baseline implementation only considers the first few
> consonants to maximize recall for genealogists who are willing to sort
> through lots of hay to find that needle. Double Metaphone seems to be out
> there and available (may require a consultation with a lawyer), while
> Metaphone 3 is clearly for sale (the license is pretty nice as long as you
> don't want to share it).
>
> As for using it with other languages, hmmm, I have to think. The phonetic
> "index" is generally would not be directly searchable in normal text; it
> isn't a phonetic representation of the word, it's just a code that similar
> sounding words tend to have.
>
> Phonetic spelling comes in a few varieties on enwiki. There are IPA
> spellings[1] and dictionary style phonetic spellings. The dictionary
> spellings can have different conventions (I don't know how well
> standardized they are on enwiki—linguists have been pushing for IPA since
> it is standardized). But even IPA can have differences of detail that make
> it unsearchable. Gorbachev has three IPA pronunciations: /ˈɡɔrbəˌtʃɔːf,
> -ˌtʃɒf/ in English, and ɡərbɐˈtɕɵf in Russian. The first one includes
> primary and secondary stress information, the second one is only the last
> syllable of the name, and the third one has primary stress info. Leaving
> any of the stress info out, or try to search for the second pronunciation,
> and you don't get a match. So, I don't think we can leverage the phonetic
> spellings that are in articles.
>
> However, it would definitely work for reasonable spellings of many words
> of non-English origin. Possibly *aparrachick* for *apparatchik, *probably
> *shadenfroid* for *schadenfreude,* but probably not *paree* for *Paris*
> (there's already a redirect for that, though!). It depends a lot on the
> spelling system of the source language (French has too many silent letters,
> for example) or the transliteration system used, and the history of the
> borrowing (when spelling and sound don't match up, English tends to keep
> one and adapt the other, which is good, but sometimes it turns weird).
>
> [1] https://en.wikipedia.org/wiki/International_Phonetic_Alphabet —
> favored by linguists, woo hoo!
>
> —Trey
>
> Trey Jones
> Software Engineer, Discovery
> Wikimedia Foundation
>
> On Thu, Jan 14, 2016 at 2:11 PM, Deborah Tankersley <
> [email protected]> wrote:
>
>> I was thinking about something like that earlier this week - when I was
>> hearing about searching for a term in a different language (other than
>> English) on the en.wikipedia site and not getting any results. Could the
>> phonetic 'search' be used for that too? Do we have any idea of how many
>> pages (in English and otherwise) that have the phonetic spelling for the
>> main topic?
>>
>> Just some additional thoughts....
>>
>> Deb
>>
>>
>> On Thu, Jan 14, 2016 at 2:00 PM, Kevin Smith <[email protected]>
>> wrote:
>>
>>> Cool idea. I would also be inclined to limit it to searches containing 4
>>> or fewer words/tokens.
>>>
>>> My only experience is with soundex, which was invented in 1918, so I'm
>>> probably not the one to ask. :P
>>>
>>>
>>>
>>> Kevin Smith
>>> Agile Coach, Wikimedia Foundation
>>>
>>>
>>> On Thu, Jan 14, 2016 at 1:53 PM, Trey Jones <[email protected]>
>>> wrote:
>>>
>>>> For some reason today I wanted to look up Mikhail Baryshnikov. It's
>>>> been a while so I forgot how to spell his last name. I didn't try very
>>>> hard, and I got no enwiki result. Google, of course, found the correct
>>>> spelling, which I then used on enwiki.
>>>>
>>>> Since I used to do name searching and matching, this gave me an idea,
>>>> which generalizes beyond just names.
>>>>
>>>> For every article title (and maybe each redirect—we could look into
>>>> that) we could generate a phonetic index[1] and store those in a special
>>>> EalasticSearch index. (We could look at storing multiple phonetic indexes
>>>> for better recall, possibly generated by multiple algorithms; some, like
>>>> Double Metaphone, generate multiple index by themselves.)
>>>>
>>>> Then, under certain circumstances (say, zero results and no suggestion
>>>> from any other source, or no result with a score above a certain cutoff, or
>>>> too few results, etc.), we could make a suggestion and/or show results
>>>> based on matching phonetic index plus some score (say, a mix of page views
>>>> and page rank, or whatever scoring we've got going on).
>>>>
>>>> So, when some doofus (hey, that's me!) comes along and searches for
>>>> "borishnakoff" (worse than what I actually searched for), we could correct
>>>> to *baryshnikov* (there's page with that title) or give *Mikhail
>>>> Baryshnikov* as a result (likely the top scoring item with the same
>>>> phonetic index in the title), or something similar.
>>>>
>>>> Other algorithms exist (and can be devised) for languages other than
>>>> English, so the maximally fleshed out version of this would offer a choice
>>>> of phonetic indexing algorithms, but I get ahead of myself.
>>>>
>>>> *Has anyone looked into this kind of phonetic indexing for enwiki,
>>>> Wikipedia in general, or other wikimedia projects before?*
>>>>
>>>> I have some additional thoughts on how to test the effectiveness of
>>>> phonetic indexing on zero results for enwiki without having to fully
>>>> implement everything if the index sounds like something we could afford to
>>>> build.
>>>>
>>>> Thoughts?
>>>>
>>>> —Trey
>>>>
>>>> [1] https://en.wikipedia.org/wiki/Phonetic_algorithm — Briefly, as an
>>>> example, you drop non-initial vowels and duplicate letters, and collapse
>>>> letters that tend to sound alike, while taking into account orthographic
>>>> conventions like sh, ch, th, initial kn- or pt-, etc. So both
>>>> *baryshnikov* and *borishnakoff* are likely to come out something like
>>>> BRXNGV.
>>>>
>>>> Trey Jones
>>>> Software Engineer, Discovery
>>>> Wikimedia Foundation
>>>>
>>>> _______________________________________________
>>>> discovery mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>>
>>>>
>>>
>>> _______________________________________________
>>> discovery mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>>
>>>
>>
>>
>> --
>> --
>> Deb Tankersley
>> Product Manager, Discovery
>> Wikimedia Foundation
>>
>> _______________________________________________
>> discovery mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/discovery
>>
>>
>
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>


-- 
-- 
Deb Tankersley
Product Manager, Discovery
Wikimedia Foundation

_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Re: [discovery] Phonetic indexing

Reply via email to