If anyone is interested, the write ups for eswiki, itwiki, and dewiki are done and available on the same page:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_for_frwiki_eswiki_itwiki_and_dewiki On Mon, Apr 18, 2016 at 6:50 PM, Trey Jones <[email protected]> wrote: > Hi Everyone, > > I've just finished my write-up for optimizing the languages that could > eventually be used for language detection on French Wikipedia. (Spanish, > Italian, and German are still to come.) > > The full write-up > <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/TextCat_Optimization_for_frwiki_eswiki_itwiki_and_dewiki> > gives details > on corpus creation and clean up, performance stats, and more. > > Briefly, about 15% of "low performing" queries (those with < 3 results) > are easily filtered junk, and 65% of the remainder are not in an > identifiable language (e.g., names, acronyms, more junk, etc.). > > Based on a sample of 682 poor-performing queries on frwiki that are in > some language, about 70% are in French, 10-15% are in English, about 7-12% > are in Arabic, fewer than 3% are in Portuguese, German, and Spanish, and > there are a handful of other languages present. > > Because of the relatively low percentage of low-performing queries that > are relevant, we will still need to do an A/B test before discussing > deploying this to frwiki. An A/B test on enwiki > <https://phabricator.wikimedia.org/T121542> in in the works at the moment. > > The optimal settings for frwiki, based on these experiments, would be to > use the TextCat query-based models for French, English, Arabic, Russian, > Chinese, Armenian, Thai, Greek, Hebrew, Korean (fr, en, ar, ru, zh, th, el, > hy, he, ko), using the default 3000-ngram models. > > —Trey > > Trey Jones > Software Engineer, Discovery > Wikimedia Foundation >
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
