Hi all Maybe you have seen this already, but back in October I did evaluation of different language detection libraries (CLD, langid.py, fastText) - results are available at http://alexott.blogspot.de/2017/10/evaluating-fasttexts-models-for.html. Recently I've updated this post with data for OpenNLP-based language detector.
The test data I used were collected during my experiments with classification of web pages - there is a link to data that you can download (~18Mb). William Colen at "Thu, 2 Nov 2017 08:55:48 -0200" wrote: WC> The Apache OpenNLP library is a machine learning based toolkit for the WC> processing of natural language text. WC> The Apache OpenNLP team is pleased to announce the release of Language WC> Detector Model 1.8.3 for Apache OpenNLP 1.8.3. WC> The Language Detector Model can detect 103 languages and outputs ISO 639-3 WC> codes. WC> Apache OpenNLP model and reports are available for download from our model WC> download page: WC> http://opennlp.apache.org/models.html WC> This is the first release of the Language Detector Model. It is compatible WC> with Apache OpenNLP 1.8.3 or better. WC> It is important to note that this model is trained for and works well with WC> longer texts that have at least 2 sentences (or more) from a single WC> language. WC> More information about this release can be found in the README.txt at: WC> https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/README.txt WC> Details about this model effectiveness can be found in the following report: WC> https://www.apache.org/dist/opennlp/models/langdetect/1.8.3/langdetect-183.bin.report.txt WC> --The Apache OpenNLP Team -- With best wishes, Alex Ott http://alexott.blogspot.com/ http://alexott.net/ http://alexott-ru.blogspot.com/ Skype: alex.ott