Fuzzy matching <https://en.wikipedia.org/wiki/Fuzzy_matching_(computer-assisted_translation)> FTW? ;)
On Thu, Jan 14, 2016 at 2:51 PM, Trey Jones <[email protected]> wrote: > There are lots of possible implementations of phonetic searching. Limiting > based on query term count would save lots of overhead, and limiting it to > terms that aren't in the index (or have very very low counts) could work, > too. These are things we could test beforehand, to see what the expense and > benefit of computing various things work out to be. > > Soundex *is* pretty old, but it works okay. It's easily modified to be a > bit smarter, too. The baseline implementation only considers the first few > consonants to maximize recall for genealogists who are willing to sort > through lots of hay to find that needle. Double Metaphone seems to be out > there and available (may require a consultation with a lawyer), while > Metaphone 3 is clearly for sale (the license is pretty nice as long as you > don't want to share it). > > As for using it with other languages, hmmm, I have to think. The phonetic > "index" is generally would not be directly searchable in normal text; it > isn't a phonetic representation of the word, it's just a code that similar > sounding words tend to have. > > Phonetic spelling comes in a few varieties on enwiki. There are IPA > spellings[1] and dictionary style phonetic spellings. The dictionary > spellings can have different conventions (I don't know how well > standardized they are on enwiki—linguists have been pushing for IPA since > it is standardized). But even IPA can have differences of detail that make > it unsearchable. Gorbachev has three IPA pronunciations: /ˈɡɔrbəˌtʃɔːf, > -ˌtʃɒf/ in English, and ɡərbɐˈtɕɵf in Russian. The first one includes > primary and secondary stress information, the second one is only the last > syllable of the name, and the third one has primary stress info. Leaving > any of the stress info out, or try to search for the second pronunciation, > and you don't get a match. So, I don't think we can leverage the phonetic > spellings that are in articles. > > However, it would definitely work for reasonable spellings of many words > of non-English origin. Possibly *aparrachick* for *apparatchik, *probably > *shadenfroid* for *schadenfreude,* but probably not *paree* for *Paris* > (there's already a redirect for that, though!). It depends a lot on the > spelling system of the source language (French has too many silent letters, > for example) or the transliteration system used, and the history of the > borrowing (when spelling and sound don't match up, English tends to keep > one and adapt the other, which is good, but sometimes it turns weird). > > [1] https://en.wikipedia.org/wiki/International_Phonetic_Alphabet — > favored by linguists, woo hoo! > > —Trey > > Trey Jones > Software Engineer, Discovery > Wikimedia Foundation > > On Thu, Jan 14, 2016 at 2:11 PM, Deborah Tankersley < > [email protected]> wrote: > >> I was thinking about something like that earlier this week - when I was >> hearing about searching for a term in a different language (other than >> English) on the en.wikipedia site and not getting any results. Could the >> phonetic 'search' be used for that too? Do we have any idea of how many >> pages (in English and otherwise) that have the phonetic spelling for the >> main topic? >> >> Just some additional thoughts.... >> >> Deb >> >> >> On Thu, Jan 14, 2016 at 2:00 PM, Kevin Smith <[email protected]> >> wrote: >> >>> Cool idea. I would also be inclined to limit it to searches containing 4 >>> or fewer words/tokens. >>> >>> My only experience is with soundex, which was invented in 1918, so I'm >>> probably not the one to ask. :P >>> >>> >>> >>> Kevin Smith >>> Agile Coach, Wikimedia Foundation >>> >>> >>> On Thu, Jan 14, 2016 at 1:53 PM, Trey Jones <[email protected]> >>> wrote: >>> >>>> For some reason today I wanted to look up Mikhail Baryshnikov. It's >>>> been a while so I forgot how to spell his last name. I didn't try very >>>> hard, and I got no enwiki result. Google, of course, found the correct >>>> spelling, which I then used on enwiki. >>>> >>>> Since I used to do name searching and matching, this gave me an idea, >>>> which generalizes beyond just names. >>>> >>>> For every article title (and maybe each redirect—we could look into >>>> that) we could generate a phonetic index[1] and store those in a special >>>> EalasticSearch index. (We could look at storing multiple phonetic indexes >>>> for better recall, possibly generated by multiple algorithms; some, like >>>> Double Metaphone, generate multiple index by themselves.) >>>> >>>> Then, under certain circumstances (say, zero results and no suggestion >>>> from any other source, or no result with a score above a certain cutoff, or >>>> too few results, etc.), we could make a suggestion and/or show results >>>> based on matching phonetic index plus some score (say, a mix of page views >>>> and page rank, or whatever scoring we've got going on). >>>> >>>> So, when some doofus (hey, that's me!) comes along and searches for >>>> "borishnakoff" (worse than what I actually searched for), we could correct >>>> to *baryshnikov* (there's page with that title) or give *Mikhail >>>> Baryshnikov* as a result (likely the top scoring item with the same >>>> phonetic index in the title), or something similar. >>>> >>>> Other algorithms exist (and can be devised) for languages other than >>>> English, so the maximally fleshed out version of this would offer a choice >>>> of phonetic indexing algorithms, but I get ahead of myself. >>>> >>>> *Has anyone looked into this kind of phonetic indexing for enwiki, >>>> Wikipedia in general, or other wikimedia projects before?* >>>> >>>> I have some additional thoughts on how to test the effectiveness of >>>> phonetic indexing on zero results for enwiki without having to fully >>>> implement everything if the index sounds like something we could afford to >>>> build. >>>> >>>> Thoughts? >>>> >>>> —Trey >>>> >>>> [1] https://en.wikipedia.org/wiki/Phonetic_algorithm — Briefly, as an >>>> example, you drop non-initial vowels and duplicate letters, and collapse >>>> letters that tend to sound alike, while taking into account orthographic >>>> conventions like sh, ch, th, initial kn- or pt-, etc. So both >>>> *baryshnikov* and *borishnakoff* are likely to come out something like >>>> BRXNGV. >>>> >>>> Trey Jones >>>> Software Engineer, Discovery >>>> Wikimedia Foundation >>>> >>>> _______________________________________________ >>>> discovery mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/discovery >>>> >>>> >>> >>> _______________________________________________ >>> discovery mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/discovery >>> >>> >> >> >> -- >> -- >> Deb Tankersley >> Product Manager, Discovery >> Wikimedia Foundation >> >> _______________________________________________ >> discovery mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/discovery >> >> > > _______________________________________________ > discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery > > -- -- Deb Tankersley Product Manager, Discovery Wikimedia Foundation
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
