Marcin Miłkowski wrote:

> W dniu 2013-09-10 10:06, Kumara Bhikkhu pisze:
>> Dear friends,
>>
>> Rules are tested against an amply large sample of Wikipedia articles.
>> That's great. But it also limits the testing on encyclopedic language
>> only. As some errors tend to happen more in conversational writing, I
>> wonder if it's possible to extend the sample to such texts.
>
> Yes, but we need to have access to a fairly big corpus. I know there was
> a freely available blog corpus but only for English.
>
> But in principle, we could have additional corpora added for our testing
> purposes.


What about this (http://www.tatoeba.org/eng/downloads) which has
sentences in many languages (creative commons):

  $ wget http://tatoeba.org/files/downloads/sentences.csv

Here is how I extract sentences in the Esperanto language (epo language
code) for example to check LT:

$ perl -F'\t' -ane 'print "@F[2 .. $#F]" if ($F[1] eq "epo")' \
    sentences.csv > sentences-epo.csv

Here are the number of sentences per languages:

$ cut -f 2 sentences.csv  | sort | uniq -c | sort -n -r
 366649 eng
 271414 epo
 224456 deu
 208862 fra
 186692 spa
 173735 jpn
 144978 tur
 137318 ita
 123692 por
 120965 rus
  94097 heb
  68252 ber
  54381 pol
  43646 cmn
  35458 hun
  31797 nld
  23404 ukr
  16336 nds
  15520 dan
  15515 fin
  14796 mar
  14262 ara
  13990 swe
  13579 pes
  11429 lat
  10083 tlh
   9867 isl
   9782 jbo
   9286 bul
   9272 lit
   7104 ina
   7089 uig
   6856 ell
   6260 tgl
   6205 srp
   5944 vie
   5369 ces
   5360 nob
   5073 hin
   4563 cat
   4461 ron
   4125 ido
   4103 wuu
   3893 glg
   3772 bel
   3255 yue
   3080 hrv
   ...etc....

I think it covers all languages supported by LT and more.

Dominique

------------------------------------------------------------------------------
How ServiceNow helps IT people transform IT departments:
1. Consolidate legacy IT systems to a single system of record for IT
2. Standardize and globalize service processes across IT
3. Implement zero-touch automation to replace manual, redundant tasks
http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to