It's good to see this.

Pine

On Wed, Jul 27, 2016 at 1:31 PM, Deborah Tankersley <
[email protected]> wrote:

> Using language detection to search the right Wikipedia
>
> Wikipedia readers speak many languages, so it’s not a surprise that
> sometimes they search for phrases not in the language of the wiki that
> they’re currently reading. This, unfortunately, can lead to poor search
> results. A recent survey we completed on English Wikipedia identified
> searches done in 40 different languages
> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Re-optimization_for_enwiki#Other_languages_searched_on_enwiki>
>  [1]!
>
> The Wikimedia Discovery department
> <http://www.mediawiki.org/wiki/Wikimedia_Discovery> [2] wants to help
> people easily find what they are looking for. In order to do this, the
> Discovery Search team is rolling out new language identification software
> to the Wikipedia search engine.
>
> This new software will detect when a search is unsuccessful, but appears
> to be in a different language. When this happens, the search results page
> will include results from the Wikipedia of the automatically detected
> language. These new cross-wiki results will be displayed along with the
> local-wiki results, if there are any. We’ve recently enabled the language
> identification and search results for the English, French, German, Italian,
> and Spanish-language Wikipedias.
>
> The next group of Wikipedias to have language detection enabled will
> includeIndonesian, Japanese, Portuguese, and Russian
> <http://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_for_ptwiki_ruwiki_jawiki_and_idwiki>
>  [3]. We
> are investigating ways to bring language detection to more Wikipedias and
> to other Wikimedia projects.
>
> The Search team has other language detection ideas and plans in the works.
> We’re thinking about ways to improve language detection with smarter
> measures of confidence
> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_and_Confidence>
>  [4].
> We are also exploring detection of search in one character set while using
> a keyboard from another character set. Early experiments with English and
> Russian
> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Typing_on_the_Wrong_Keyboard%E2%80%94Russian_and_English>
>  [5]
> are promising!
>
> You can find technical details about our new language detection module
> (TextCat) onMediaWiki.org <https://www.mediawiki.org/wiki/TextCat> [6].
> PHP <https://github.com/wikimedia/wikimedia-textcat> [7] and updated Perl
> <https://github.com/Trey314159/TextCat> [8] libraries are also available
> and the libraries include language models for dozens of languages.
>
> You can also test the language detection using our online demo
> <https://tools.wmflabs.org/textcatdemo/> [9]. The demo lets you try all
> the different language models on your own text. It also includes tutorials
> and lots of additional information about TextCat’s internal workings.
>
> Let’s get searching - now with language detection and better results! You
> can read theblog post
> <https://blog.wikimedia.org/2016/07/27/wikipedia-language-search/> [10]
> and more detailed information is here
> <https://commons.wikimedia.org/wiki/File:Wikipedia_Seeks_to_Speak_Your_Language.pdf>
>  [11].
>
> *Here's some nice screenshots of what it looked like before we added in
> the language detection...[12]*
>
>
>
> *and after we added in the language detection for a Russian query on
> English Wikipedia [13]:*
>
>
>
>
> *Thanks for reading - from the Discovery Search Team Gnomes!*
>
>
> [1]
> https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Re-optimization_for_enwiki#Other_languages_searched_on_enwiki
> [2] http://www.mediawiki.org/wiki/Wikimedia_Discovery
> [3]
> http://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_for_ptwiki_ruwiki_jawiki_and_idwiki
> [4]
> https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_and_Confidence
> [5]
> https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Typing_on_the_Wrong_Keyboard%E2%80%94Russian_and_English
> [6] https://www.mediawiki.org/wiki/TextCat
> [7] https://github.com/wikimedia/wikimedia-textcat
> [8] https://github.com/Trey314159/TextCat
> [9] https://tools.wmflabs.org/textcatdemo/
> [10] https://blog.wikimedia.org/2016/07/27/wikipedia-language-search/
> [11]
> https://commons.wikimedia.org/wiki/File:Wikipedia_Seeks_to_Speak_Your_Language.pdf
> [12]
> https://commons.wikimedia.org/wiki/File%3AExisting-search_no-textcat.png
> [13] https://commons.wikimedia.org/wiki/File%3ANew-search_with-textcat.png
>
> --
> Deb Tankersley
> Product Manager, Discovery
> IRC: debt
> Wikimedia Foundation
>
> _______________________________________________
> discovery mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/discovery
>
>
_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Reply via email to