><ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf>ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf > > > > Linguini: Language Identification for Multilingual Documents > > John M. Prager > > Prager also uses an n-gram approach, so you might be able to take > advantage of some of his research into optimal values for <n>.
Yeah.. though to be honest I as long as you're on the long tail portion of N the values won't matter much I think. All you'll do is waste a bit of memory (like 1k) > The code to Linguini doesn't seem to be available (you have to > purchase some IBM product(s) to get it) so what you've done is great > for the open source community - thanks! > > Also I could post to the Unicode list re training data in multiple > languages, as that's a good place to find out about multilingual > corpora. Yeah. That was my biggest problem. This area had never really been solved in the OSS world. -- Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://www.feedblog.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]