You may also be able to extract some useful information from the character
encoding (available in the Content-Type header - see
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.11).

Obviously this won't always be useful, but encodings like Shift-JIS are
pretty good indicators of the language (Japanese in that case)

Nick

> -----Original Message-----
> From: Sami Siren [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, 30 June 2004 6:47 AM
> To: [EMAIL PROTECTED]
> Subject: [Nutch-dev] language-identifier
> Importance: Low
> 
> 
> I just uploaded a patch that adds a language identifier 
> plugin to nutch.
> 
> http://sourceforge.net/tracker/index.php?func=detail&aid=98226
> 3&group_id=59548&atid=491356
> 
> The process of identification is as follows:
> 
> 1. html (html only, HTML 4.0 "lang" attribute)
> 2. meta tags (html only, http-equiv, dc.language)
> 3. http header (Content-Language)
> 4. if all above fail "statistical analysis"
> 
> 1 & 2 are run during the fetching phase and 3 & 4 are run on 
> indexing phase.
> 
> Currently supported languages (in "statistical analysis") are
> da,de,el,en,es,fi,fr,it,nl,sv and pt. The corpus used was grabbed from
> http://www.isi.edu/~koehn/europarl/ and the profiles were build with
> tool supplied in patch.
> 
> After indexing the language can be found from field named "lang"
> 
> it's not 100% accurate but it's a start.
> 
> --
>  Sami Siren
> 
> 
> -------------------------------------------------------
> This SF.Net email sponsored by Black Hat Briefings & Training.
> Attend Black Hat Briefings & Training, Las Vegas July 24-29 - 
> digital self defense, top technical experts, no vendor pitches, 
> unmatched networking opportunities. Visit www.blackhat.com
> _______________________________________________
> Nutch-developers mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
> 


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - 
digital self defense, top technical experts, no vendor pitches, 
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to