Greetings language nerds, I've completed the creation of a 21-language balanced (i.e., 200 each) corpus of relatively clean queries for use in evaluating language identification model testing. The 21 languages were chosen based on query volume across wikis in those languages. I've also evaluated our current version of TextCat against this corpus, using the known 21 languages, and all 59 languages I have models for.
The 21 languages have pretty good models, because they had lots of query volume to be built on. The full set of 59 is a bit more dodgy, esp. Igbo, which is known to have a lot of English in the training data. Indonesian is the most unexpectedly poor performing of the bunch (most other poor performance is across language or script families and so is expected). The best model size among those test (500 to 10K), was the full 10,000! However performance at the 3,000 ngram model size (what we've been using for A/B tests) was only a few percentage points worse. Full write up with lots more details here: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Balanced_Language_Identification_Evaluation_Set_for_Queries I'll commit models for the rest of these 21 languages after verifying that they won't mess up our A/B tests. Cheers, —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
