> > 
> > I think the most timeconsuming part of language identifier is 
> > splitting the text into ngrams and propably the biggest 
> optimization 
> > could be done there.
> > 
> > perhaps a configurable variable to set maximum text length to be 
> > analyzed. also the minimum limit could be defined because 
> with small 
> > amount of ngrams the performance (as quality) is very poor.
> > 
> > I'll do some experimets also to see how speed could be improved.
> 
> I remember reading somewhere about n-gram language detection 
> that taking the first 512 characters of the text is usually 
> sufficient enough, but I can't recall where I read it... That 
> process used n-gram profiles built from 2-5 n-grams, and each 
> profile was limited to the first 300 of most frequent ngrams.
> 
> 

Languid (http://languid.cantbedone.org/) requires "at least 20
characters of UTF-8 encoded text". I haven't read the code for it but I
presume it uses n-grams.

Nick


IMPORTANT: This e-mail, including any attachments, may contain private or 
confidential information. If you think you may not be the intended recipient, 
or if you have received this e-mail in error, please contact the sender 
immediately and delete all copies of this e-mail. If you are not the intended 
recipient, you must not reproduce any part of this e-mail or disclose its 
contents to any other party.
This email represents the views of the individual sender, which do not 
necessarily reflect those of education.au limited except where the sender 
expressly states otherwise.
It is your responsibility to scan this email and any files transmitted with it 
for viruses or any other defects.
education.au limited will not be liable for any loss, damage or consequence 
caused directly or indirectly by this email. 


-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to