Thanks, it helps a little.

My problem is the poor quality of detection for Dutch, maybe because of 
bad training.

Training with better data than Wikipedia would probably help. A Wiki is 
focussed on non-daily objects, lots af them abroad or special. That is 
why a Wiki is bad training material right from the start.

So I am curious how it is trained and used.

Ruud

On 17-11-12 18:05, Susana Sotelo Docio wrote:
> Ruud Baars escribiu:
>> Hi I tried to read the documentation, but that is very technical. Not a
>> word about what it is really able to, and how it is trained.
>>
>> Would you know where I could find some info on this non-programmer level?
>> I need to find out the quality of distinction possible between
>> old-fashiond Dutch, German, Afrikaans, Frysian etc., for better
>> filtering of a corpus.
> Hi Ruud,
>
> in this article you can find an explanation about the inner functioning of
> the Tika Language Identifier.
>
> http://www.ibm.com/developerworks/opensource/tutorials/os-apache-tika/section6.html
>
> AFAIK, most language identification tools are based on the algorithm
> described in this paper:
>
> William B. Cavnar, John M. Trenkle: N-Gram-Based Text Categorization
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.9367
>
> Hope this helps. :)
>


------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to