I just uploaded a patch that adds a language identifier plugin to nutch.
http://sourceforge.net/tracker/index.php?func=detail&aid=982263&group_id=59548&atid=491356
The process of identification is as follows:
1. html (html only, HTML 4.0 "lang" attribute) 2. meta tags (html only, http-equiv, dc.language) 3. http header (Content-Language) 4. if all above fail "statistical analysis"
1 & 2 are run during the fetching phase and 3 & 4 are run on indexing phase.
Currently supported languages (in "statistical analysis") are da,de,el,en,es,fi,fr,it,nl,sv and pt. The corpus used was grabbed from http://www.isi.edu/~koehn/europarl/ and the profiles were build with tool supplied in patch.
After indexing the language can be found from field named "lang"
it's not 100% accurate but it's a start.
-- Sami Siren
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
