Thanks! I've been looking for the exact same kind of tool a few days ago and in the end I settled on Google's CLD (Compact Language Detector) which was extracted out of Chrome's sources: http://code.google.com/p/chromium-compact-language-detector/
It worked very well. If I will have the time I'll do a comparision between that and Langmatch. Thanks and best regards, Tamas On Wed, Jan 25, 2012 at 1:47 PM, Tom Hoar <[email protected]> wrote: > This message announces the publication of possibly best open source > language identification tool available... ever. > > https://launchpad.net/langmatch > > Langmatch is a Python command-line tool that guesses the language of a > text string of any length (of course longer is better). Langmatch uses > language maps (fingerprints or models) from a variety of popular open > source tools (such as mguesser and libtextcat) or users can use > langmatch to create their own maps of any n-gram length. It can use the > 451 3-gram models from the Python NLTK. Note that the NLTK language maps > have not yet been uploaded to the launchpad.net repository, although the > plan is to do so. You can obtain them from the NLTK corpus > (nltk_data/corpora/langid), or I'm happy to distribute to you directly > under under GNU GPL3. The Python code has been optimized for > performance. Maps of up to 7 grams run amazingly fast. > > Langmatch seems infinitely configurable. You can run langmatch from the > command line or import it directly into your own program. End-of-line, > whole documents processing, you name it. It can report its raw scores, > or just the voted result. You can pick which maps to use for analysis > without moving the maps in/out of the installation folder. Also, the > author informally brands his file-opening code > "lib-open-my-god-damn-file.py". You get: > * stdin/stdout > * regular paths and filenames on Windows/Posix > * URLs: read files directly via http, ftp, ftps, etc > * custom URI types > * transparent decompression on local files > * iteration of directories for input > * substitution of default filenames for output > * much, much more > > Langmatch has been tested on Linux Python 2.6, 2.7, 3 and MS Windows > Python 2.7 (should work on 2.6 & 3 for Windows and Mac). It is fully > Unicode-aware. It is distributed under the GNU GPL v3 license. > > I hope some of you on this list will enjoy this tool. > > Tom > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
