I really love this idea! On 14 January 2016 at 14:11, Deborah Tankersley <[email protected]> wrote: > I was thinking about something like that earlier this week - when I was > hearing about searching for a term in a different language (other than > English) on the en.wikipedia site and not getting any results. Could the > phonetic 'search' be used for that too? Do we have any idea of how many > pages (in English and otherwise) that have the phonetic spelling for the > main topic? > > Just some additional thoughts.... > > Deb > > > On Thu, Jan 14, 2016 at 2:00 PM, Kevin Smith <[email protected]> wrote: >> >> Cool idea. I would also be inclined to limit it to searches containing 4 >> or fewer words/tokens. >> >> My only experience is with soundex, which was invented in 1918, so I'm >> probably not the one to ask. :P >> >> >> >> Kevin Smith >> Agile Coach, Wikimedia Foundation >> >> >> On Thu, Jan 14, 2016 at 1:53 PM, Trey Jones <[email protected]> wrote: >>> >>> For some reason today I wanted to look up Mikhail Baryshnikov. It's been >>> a while so I forgot how to spell his last name. I didn't try very hard, and >>> I got no enwiki result. Google, of course, found the correct spelling, which >>> I then used on enwiki. >>> >>> Since I used to do name searching and matching, this gave me an idea, >>> which generalizes beyond just names. >>> >>> For every article title (and maybe each redirect—we could look into that) >>> we could generate a phonetic index[1] and store those in a special >>> EalasticSearch index. (We could look at storing multiple phonetic indexes >>> for better recall, possibly generated by multiple algorithms; some, like >>> Double Metaphone, generate multiple index by themselves.) >>> >>> Then, under certain circumstances (say, zero results and no suggestion >>> from any other source, or no result with a score above a certain cutoff, or >>> too few results, etc.), we could make a suggestion and/or show results based >>> on matching phonetic index plus some score (say, a mix of page views and >>> page rank, or whatever scoring we've got going on). >>> >>> So, when some doofus (hey, that's me!) comes along and searches for >>> "borishnakoff" (worse than what I actually searched for), we could correct >>> to baryshnikov (there's page with that title) or give Mikhail Baryshnikov as >>> a result (likely the top scoring item with the same phonetic index in the >>> title), or something similar. >>> >>> Other algorithms exist (and can be devised) for languages other than >>> English, so the maximally fleshed out version of this would offer a choice >>> of phonetic indexing algorithms, but I get ahead of myself. >>> >>> Has anyone looked into this kind of phonetic indexing for enwiki, >>> Wikipedia in general, or other wikimedia projects before? >>> >>> I have some additional thoughts on how to test the effectiveness of >>> phonetic indexing on zero results for enwiki without having to fully >>> implement everything if the index sounds like something we could afford to >>> build. >>> >>> Thoughts? >>> >>> —Trey >>> >>> [1] https://en.wikipedia.org/wiki/Phonetic_algorithm — Briefly, as an >>> example, you drop non-initial vowels and duplicate letters, and collapse >>> letters that tend to sound alike, while taking into account orthographic >>> conventions like sh, ch, th, initial kn- or pt-, etc. So both baryshnikov >>> and borishnakoff are likely to come out something like BRXNGV. >>> >>> Trey Jones >>> Software Engineer, Discovery >>> Wikimedia Foundation >>> >>> _______________________________________________ >>> discovery mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/discovery >>> >> >> >> _______________________________________________ >> discovery mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/discovery >> > > > > -- > -- > Deb Tankersley > Product Manager, Discovery > Wikimedia Foundation > > _______________________________________________ > discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery >
-- Oliver Keyes Count Logula Wikimedia Foundation _______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
