It's good to see this. Pine
On Wed, Jul 27, 2016 at 1:31 PM, Deborah Tankersley < [email protected]> wrote: > Using language detection to search the right Wikipedia > > Wikipedia readers speak many languages, so it’s not a surprise that > sometimes they search for phrases not in the language of the wiki that > they’re currently reading. This, unfortunately, can lead to poor search > results. A recent survey we completed on English Wikipedia identified > searches done in 40 different languages > <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Re-optimization_for_enwiki#Other_languages_searched_on_enwiki> > [1]! > > The Wikimedia Discovery department > <http://www.mediawiki.org/wiki/Wikimedia_Discovery> [2] wants to help > people easily find what they are looking for. In order to do this, the > Discovery Search team is rolling out new language identification software > to the Wikipedia search engine. > > This new software will detect when a search is unsuccessful, but appears > to be in a different language. When this happens, the search results page > will include results from the Wikipedia of the automatically detected > language. These new cross-wiki results will be displayed along with the > local-wiki results, if there are any. We’ve recently enabled the language > identification and search results for the English, French, German, Italian, > and Spanish-language Wikipedias. > > The next group of Wikipedias to have language detection enabled will > includeIndonesian, Japanese, Portuguese, and Russian > <http://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_for_ptwiki_ruwiki_jawiki_and_idwiki> > [3]. We > are investigating ways to bring language detection to more Wikipedias and > to other Wikimedia projects. > > The Search team has other language detection ideas and plans in the works. > We’re thinking about ways to improve language detection with smarter > measures of confidence > <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_and_Confidence> > [4]. > We are also exploring detection of search in one character set while using > a keyboard from another character set. Early experiments with English and > Russian > <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Typing_on_the_Wrong_Keyboard%E2%80%94Russian_and_English> > [5] > are promising! > > You can find technical details about our new language detection module > (TextCat) onMediaWiki.org <https://www.mediawiki.org/wiki/TextCat> [6]. > PHP <https://github.com/wikimedia/wikimedia-textcat> [7] and updated Perl > <https://github.com/Trey314159/TextCat> [8] libraries are also available > and the libraries include language models for dozens of languages. > > You can also test the language detection using our online demo > <https://tools.wmflabs.org/textcatdemo/> [9]. The demo lets you try all > the different language models on your own text. It also includes tutorials > and lots of additional information about TextCat’s internal workings. > > Let’s get searching - now with language detection and better results! You > can read theblog post > <https://blog.wikimedia.org/2016/07/27/wikipedia-language-search/> [10] > and more detailed information is here > <https://commons.wikimedia.org/wiki/File:Wikipedia_Seeks_to_Speak_Your_Language.pdf> > [11]. > > *Here's some nice screenshots of what it looked like before we added in > the language detection...[12]* > > > > *and after we added in the language detection for a Russian query on > English Wikipedia [13]:* > > > > > *Thanks for reading - from the Discovery Search Team Gnomes!* > > > [1] > https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Re-optimization_for_enwiki#Other_languages_searched_on_enwiki > [2] http://www.mediawiki.org/wiki/Wikimedia_Discovery > [3] > http://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_for_ptwiki_ruwiki_jawiki_and_idwiki > [4] > https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_and_Confidence > [5] > https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Typing_on_the_Wrong_Keyboard%E2%80%94Russian_and_English > [6] https://www.mediawiki.org/wiki/TextCat > [7] https://github.com/wikimedia/wikimedia-textcat > [8] https://github.com/Trey314159/TextCat > [9] https://tools.wmflabs.org/textcatdemo/ > [10] https://blog.wikimedia.org/2016/07/27/wikipedia-language-search/ > [11] > https://commons.wikimedia.org/wiki/File:Wikipedia_Seeks_to_Speak_Your_Language.pdf > [12] > https://commons.wikimedia.org/wiki/File%3AExisting-search_no-textcat.png > [13] https://commons.wikimedia.org/wiki/File%3ANew-search_with-textcat.png > > -- > Deb Tankersley > Product Manager, Discovery > IRC: debt > Wikimedia Foundation > > _______________________________________________ > discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery > >
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
