Guys, Results for language detection by Google langdetect are very poor. After having trained a Dutch profile with 16 GB of (almost) perfect Dutch, there was no situation where all perfect sentences were classified as Dutch for more then 80%.
I consider that poor results. I will have to think about that; longer ngrams might be needed. In serveral tests TIKA is reported as faster but not better then the Google detection. Good enough probably for LT purposes, but not for purposes like separating old-fashioned Dutch from current Dutch, Frysian, German and other relatively close languages/dialects. Ruud On 17-11-12 19:26, Ruud Baars wrote: > Thanks, it helps a little. > > My problem is the poor quality of detection for Dutch, maybe because of > bad training. > > Training with better data than Wikipedia would probably help. A Wiki is > focussed on non-daily objects, lots af them abroad or special. That is > why a Wiki is bad training material right from the start. > > So I am curious how it is trained and used. > > Ruud > > On 17-11-12 18:05, Susana Sotelo Docio wrote: >> Ruud Baars escribiu: >>> Hi I tried to read the documentation, but that is very technical. Not a >>> word about what it is really able to, and how it is trained. >>> >>> Would you know where I could find some info on this non-programmer level? >>> I need to find out the quality of distinction possible between >>> old-fashiond Dutch, German, Afrikaans, Frysian etc., for better >>> filtering of a corpus. >> Hi Ruud, >> >> in this article you can find an explanation about the inner functioning of >> the Tika Language Identifier. >> >> http://www.ibm.com/developerworks/opensource/tutorials/os-apache-tika/section6.html >> >> AFAIK, most language identification tools are based on the algorithm >> described in this paper: >> >> William B. Cavnar, John M. Trenkle: N-Gram-Based Text Categorization >> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.9367 >> >> Hope this helps. :) >> > > ------------------------------------------------------------------------------ > Monitor your physical, virtual and cloud infrastructure from a single > web console. Get in-depth insight into apps, servers, databases, vmware, > SAP, cloud infrastructure, etc. Download 30-day Free Trial. > Pricing starts from $795 for 25 servers or applications! > http://p.sf.net/sfu/zoho_dev2dev_nov > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel ------------------------------------------------------------------------------ Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel