Trey, thank you for sharing it. As a Hebrew speaker I noticed it gets ~100% but it is too easy task, as it just learns the Hebrew alphabet, as there is no Yiddish (with similar script) to compete it. So maybe the selected languages shouldn't be the top 20, but based on the "top scripts"?
By the way, is there a measure how well language detection based on text/query works compared to models based on the user metadata (such as based on location/IP)? (I think language team when worked on ULS created the UI based on such heuristics) On Tue, Mar 1, 2016 at 6:15 PM, Trey Jones <[email protected]> wrote: > Greetings language nerds, > > I've completed the creation of a 21-language balanced (i.e., 200 each) > corpus of relatively clean queries for use in evaluating language > identification model testing. The 21 languages were chosen based on query > volume across wikis in those languages. I've also evaluated our current > version of TextCat against this corpus, using the known 21 languages, and > all 59 languages I have models for. > > The 21 languages have pretty good models, because they had lots of query > volume to be built on. The full set of 59 is a bit more dodgy, esp. Igbo, > which is known to have a lot of English in the training data. > > Indonesian is the most unexpectedly poor performing of the bunch (most > other poor performance is across language or script families and so is > expected). > > The best model size among those test (500 to 10K), was the full 10,000! > However performance at the 3,000 ngram model size (what we've been using > for A/B tests) was only a few percentage points worse. > > Full write up with lots more details here: > > https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Balanced_Language_Identification_Evaluation_Set_for_Queries > > I'll commit models for the rest of these 21 languages after verifying that > they won't mess up our A/B tests. > > Cheers, > —Trey > > Trey Jones > Software Engineer, Discovery > Wikimedia Foundation > > _______________________________________________ > discovery mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/discovery > >
_______________________________________________ discovery mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/discovery
