Greetings language nerds,

I've completed the creation of a 21-language balanced (i.e., 200 each)
corpus of relatively clean queries for use in evaluating language
identification model testing. The 21 languages were chosen based on query
volume across wikis in those languages. I've also evaluated our current
version of TextCat against this corpus, using the known 21 languages, and
all 59 languages I have models for.

The 21 languages have pretty good models, because they had lots of query
volume to be built on. The full set of 59 is a bit more dodgy, esp. Igbo,
which is known to have a lot of English in the training data.

Indonesian is the most unexpectedly poor performing of the bunch (most
other poor performance is across language or script families and so is
expected).

The best model size among those test (500 to 10K), was the full 10,000!
However performance at the 3,000 ngram model size (what we've been using
for A/B tests) was only a few percentage points worse.

Full write up with lots more details here:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Balanced_Language_Identification_Evaluation_Set_for_Queries

I'll commit models for the rest of these 21 languages after verifying that
they won't mess up our A/B tests.

Cheers,
—Trey

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery

Reply via email to