This message announces the publication of possibly best open source language identification tool available... ever.
https://launchpad.net/langmatch Langmatch is a Python command-line tool that guesses the language of a text string of any length (of course longer is better). Langmatch uses language maps (fingerprints or models) from a variety of popular open source tools (such as mguesser and libtextcat) or users can use langmatch to create their own maps of any n-gram length. It can use the 451 3-gram models from the Python NLTK. Note that the NLTK language maps have not yet been uploaded to the launchpad.net repository, although the plan is to do so. You can obtain them from the NLTK corpus (nltk_data/corpora/langid), or I'm happy to distribute to you directly under under GNU GPL3. The Python code has been optimized for performance. Maps of up to 7 grams run amazingly fast. Langmatch seems infinitely configurable. You can run langmatch from the command line or import it directly into your own program. End-of-line, whole documents processing, you name it. It can report its raw scores, or just the voted result. You can pick which maps to use for analysis without moving the maps in/out of the installation folder. Also, the author informally brands his file-opening code "lib-open-my-god-damn-file.py". You get: * stdin/stdout * regular paths and filenames on Windows/Posix * URLs: read files directly via http, ftp, ftps, etc * custom URI types * transparent decompression on local files * iteration of directories for input * substitution of default filenames for output * much, much more Langmatch has been tested on Linux Python 2.6, 2.7, 3 and MS Windows Python 2.7 (should work on 2.6 & 3 for Windows and Mac). It is fully Unicode-aware. It is distributed under the GNU GPL v3 license. I hope some of you on this list will enjoy this tool. Tom _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
