Hello Byron, I know you can enable language detect during index-more however is there a method to doing this during the crawl?
OG: I don't recall the name of the class/method (UrlFilter?), nor do I recall whether that is an extension point for a plugin, but wherever the point in the fetch exection is where you can get a hold of the page content (text), you can potentially use the language identification framework to guess the language of the page. I'm interested in building an index as english only right now. what is the theory behind that? anyone have any experience? The theory is that you need to train the lang id library to learn to recognize English, and then throw sufficiently long chunks of text at it, and let it guess the lingo. I believe Jerome has the training files somewhere (and a page about all this on the Wiki). would it be building a huge black list, ignoring tlds until you find a computational method or??? thoughts anyone? If you really want to distinguish by language, then you need to use language recognition, not TLD filtering. A .hr (Croatia) site/page could easily be all in English, for instance, and a .com site may easily be in Chinese. Otis ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
