> > > I think the most timeconsuming part of language identifier is > > > splitting the text into ngrams and propably the biggest > > optimization > > > could be done there.
Not easy to optimize !!! > I remember reading somewhere about n-gram language detection > > that taking the first 512 characters of the text is usually > > sufficient enough, but I can't recall where I read it... That > > process used n-gram profiles built from 2-5 n-grams, and each > > profile was limited to the first 300 of most frequent ngrams. > Languid (http://languid.cantbedone.org/) requires "at least 20 > characters of UTF-8 encoded text". I haven't read the code for it but I > presume it uses n-grams. Yes. But how to be sure that the first 20 or 512 characters of a documents are in the same language as the whole document? I think the language identifier must process the whole document to clearly identify its main language. Jerome -- http://motrech.free.fr/ http://frutch.free.fr/
