> > > I think the most timeconsuming part of language identifier is
> > > splitting the text into ngrams and propably the biggest
> > optimization
> > > could be done there.


Not easy to optimize !!! 

> I remember reading somewhere about n-gram language detection
> > that taking the first 512 characters of the text is usually
> > sufficient enough, but I can't recall where I read it... That
> > process used n-gram profiles built from 2-5 n-grams, and each
> > profile was limited to the first 300 of most frequent ngrams.
> Languid (http://languid.cantbedone.org/) requires "at least 20
> characters of UTF-8 encoded text". I haven't read the code for it but I
> presume it uses n-grams.


Yes. But how to be sure that the first 20 or 512 characters of a documents 
are in the same language as the whole document?
I think the language identifier must process the whole document to clearly 
identify its main language.

Jerome


-- 
http://motrech.free.fr/
http://frutch.free.fr/

Reply via email to