[Moses-support] New cross-platform language guessing/ID tool

Tom Hoar Wed, 25 Jan 2012 04:48:25 -0800

 This message announces the publication of possibly best open source 
 language identification tool available... ever.


 https://launchpad.net/langmatch

 Langmatch is a Python command-line tool that guesses the language of a 
 text string of any length (of course longer is better). Langmatch uses 
 language maps (fingerprints or models) from a variety of popular open 
 source tools (such as mguesser and libtextcat) or users can use 
 langmatch to create their own maps of any n-gram length. It can use the 
 451 3-gram models from the Python NLTK. Note that the NLTK language maps 
 have not yet been uploaded to the launchpad.net repository, although the 
 plan is to do so. You can obtain them from the NLTK corpus 
 (nltk_data/corpora/langid), or I'm happy to distribute to you directly 
 under under GNU GPL3. The Python code has been optimized for 
 performance. Maps of up to 7 grams run amazingly fast.

 Langmatch seems infinitely configurable. You can run langmatch from the 
 command line or import it directly into your own program. End-of-line, 
 whole documents processing, you name it. It can report its raw scores, 
 or just the voted result. You can pick which maps to use for analysis 
 without moving the maps in/out of the installation folder. Also, the 
 author informally brands his file-opening code 
 "lib-open-my-god-damn-file.py". You get:
   * stdin/stdout
   * regular paths and filenames on Windows/Posix
   * URLs: read files directly via http, ftp, ftps, etc
   * custom URI types
   * transparent decompression on local files
   * iteration of directories for input
   * substitution of default filenames for output
   * much, much more

 Langmatch has been tested on Linux Python 2.6, 2.7, 3 and MS Windows 
 Python 2.7 (should work on 2.6 & 3 for Windows and Mac). It is fully 
 Unicode-aware. It is distributed under the GNU GPL v3 license.

 I hope some of you on this list will enjoy this tool.

 Tom
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] New cross-platform language guessing/ID tool

Reply via email to