You may also be able to extract some useful information from the character encoding (available in the Content-Type header - see http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.11).
Obviously this won't always be useful, but encodings like Shift-JIS are pretty good indicators of the language (Japanese in that case) Nick > -----Original Message----- > From: Sami Siren [mailto:[EMAIL PROTECTED] > Sent: Wednesday, 30 June 2004 6:47 AM > To: [EMAIL PROTECTED] > Subject: [Nutch-dev] language-identifier > Importance: Low > > > I just uploaded a patch that adds a language identifier > plugin to nutch. > > http://sourceforge.net/tracker/index.php?func=detail&aid=98226 > 3&group_id=59548&atid=491356 > > The process of identification is as follows: > > 1. html (html only, HTML 4.0 "lang" attribute) > 2. meta tags (html only, http-equiv, dc.language) > 3. http header (Content-Language) > 4. if all above fail "statistical analysis" > > 1 & 2 are run during the fetching phase and 3 & 4 are run on > indexing phase. > > Currently supported languages (in "statistical analysis") are > da,de,el,en,es,fi,fr,it,nl,sv and pt. The corpus used was grabbed from > http://www.isi.edu/~koehn/europarl/ and the profiles were build with > tool supplied in patch. > > After indexing the language can be found from field named "lang" > > it's not 100% accurate but it's a start. > > -- > Sami Siren > > > ------------------------------------------------------- > This SF.Net email sponsored by Black Hat Briefings & Training. > Attend Black Hat Briefings & Training, Las Vegas July 24-29 - > digital self defense, top technical experts, no vendor pitches, > unmatched networking opportunities. Visit www.blackhat.com > _______________________________________________ > Nutch-developers mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/nutch-developers > ------------------------------------------------------- This SF.Net email sponsored by Black Hat Briefings & Training. Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
