> > > > I think the most timeconsuming part of language identifier is > > splitting the text into ngrams and propably the biggest > optimization > > could be done there. > > > > perhaps a configurable variable to set maximum text length to be > > analyzed. also the minimum limit could be defined because > with small > > amount of ngrams the performance (as quality) is very poor. > > > > I'll do some experimets also to see how speed could be improved. > > I remember reading somewhere about n-gram language detection > that taking the first 512 characters of the text is usually > sufficient enough, but I can't recall where I read it... That > process used n-gram profiles built from 2-5 n-grams, and each > profile was limited to the first 300 of most frequent ngrams. > >
Languid (http://languid.cantbedone.org/) requires "at least 20 characters of UTF-8 encoded text". I haven't read the code for it but I presume it uses n-grams. Nick IMPORTANT: This e-mail, including any attachments, may contain private or confidential information. If you think you may not be the intended recipient, or if you have received this e-mail in error, please contact the sender immediately and delete all copies of this e-mail. If you are not the intended recipient, you must not reproduce any part of this e-mail or disclose its contents to any other party. This email represents the views of the individual sender, which do not necessarily reflect those of education.au limited except where the sender expressly states otherwise. It is your responsibility to scan this email and any files transmitted with it for viruses or any other defects. education.au limited will not be liable for any loss, damage or consequence caused directly or indirectly by this email. ------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
