Marcin Miłkowski wrote: > W dniu 2013-09-10 10:06, Kumara Bhikkhu pisze: >> Dear friends, >> >> Rules are tested against an amply large sample of Wikipedia articles. >> That's great. But it also limits the testing on encyclopedic language >> only. As some errors tend to happen more in conversational writing, I >> wonder if it's possible to extend the sample to such texts. > > Yes, but we need to have access to a fairly big corpus. I know there was > a freely available blog corpus but only for English. > > But in principle, we could have additional corpora added for our testing > purposes.
What about this (http://www.tatoeba.org/eng/downloads) which has sentences in many languages (creative commons): $ wget http://tatoeba.org/files/downloads/sentences.csv Here is how I extract sentences in the Esperanto language (epo language code) for example to check LT: $ perl -F'\t' -ane 'print "@F[2 .. $#F]" if ($F[1] eq "epo")' \ sentences.csv > sentences-epo.csv Here are the number of sentences per languages: $ cut -f 2 sentences.csv | sort | uniq -c | sort -n -r 366649 eng 271414 epo 224456 deu 208862 fra 186692 spa 173735 jpn 144978 tur 137318 ita 123692 por 120965 rus 94097 heb 68252 ber 54381 pol 43646 cmn 35458 hun 31797 nld 23404 ukr 16336 nds 15520 dan 15515 fin 14796 mar 14262 ara 13990 swe 13579 pes 11429 lat 10083 tlh 9867 isl 9782 jbo 9286 bul 9272 lit 7104 ina 7089 uig 6856 ell 6260 tgl 6205 srp 5944 vie 5369 ces 5360 nob 5073 hin 4563 cat 4461 ron 4125 ido 4103 wuu 3893 glg 3772 bel 3255 yue 3080 hrv ...etc.... I think it covers all languages supported by LT and more. Dominique ------------------------------------------------------------------------------ How ServiceNow helps IT people transform IT departments: 1. Consolidate legacy IT systems to a single system of record for IT 2. Standardize and globalize service processes across IT 3. Implement zero-touch automation to replace manual, redundant tasks http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel