Guys,

Results for language detection by Google langdetect are very poor. After 
having trained a Dutch profile with 16 GB of (almost) perfect Dutch, 
there was no situation where all perfect sentences were classified as 
Dutch for more then 80%.

I consider that poor results. I will have to think about that; longer 
ngrams might be needed.

In serveral tests TIKA is reported as faster but not better then the 
Google detection.

Good enough probably for LT purposes, but not for purposes like 
separating old-fashioned Dutch from current Dutch, Frysian, German and 
other relatively close languages/dialects.



Ruud



On 17-11-12 19:26, Ruud Baars wrote:
> Thanks, it helps a little.
>
> My problem is the poor quality of detection for Dutch, maybe because of
> bad training.
>
> Training with better data than Wikipedia would probably help. A Wiki is
> focussed on non-daily objects, lots af them abroad or special. That is
> why a Wiki is bad training material right from the start.
>
> So I am curious how it is trained and used.
>
> Ruud
>
> On 17-11-12 18:05, Susana Sotelo Docio wrote:
>> Ruud Baars escribiu:
>>> Hi I tried to read the documentation, but that is very technical. Not a
>>> word about what it is really able to, and how it is trained.
>>>
>>> Would you know where I could find some info on this non-programmer level?
>>> I need to find out the quality of distinction possible between
>>> old-fashiond Dutch, German, Afrikaans, Frysian etc., for better
>>> filtering of a corpus.
>> Hi Ruud,
>>
>> in this article you can find an explanation about the inner functioning of
>> the Tika Language Identifier.
>>
>> http://www.ibm.com/developerworks/opensource/tutorials/os-apache-tika/section6.html
>>
>> AFAIK, most language identification tools are based on the algorithm
>> described in this paper:
>>
>> William B. Cavnar, John M. Trenkle: N-Gram-Based Text Categorization
>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.9367
>>
>> Hope this helps. :)
>>
>
> ------------------------------------------------------------------------------
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!
> http://p.sf.net/sfu/zoho_dev2dev_nov
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel


------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to