Re: [Nutch-general] Limiting search/crawl to specific language

ogjunk-nutch Tue, 03 Jan 2006 23:12:13 -0800

Hello Byron,

I know you can enable language detect during
index-more however is there a method to doing this
during the crawl?


OG: I don't recall the name of the class/method (UrlFilter?), nor do I recall 
whether that is an extension point for a plugin, but wherever the point in the 
fetch exection is where you can get a hold of the page content (text), you can 
potentially use the language identification framework to guess the language of 
the page.

I'm interested in building an index as english only
right now. what is the theory behind that? anyone have
any experience?

The theory is that you need to train the lang id library to learn to recognize 
English, and then throw sufficiently long chunks of text at it, and let it 
guess the lingo.  I believe Jerome has the training files somewhere (and a page 
about all this on the Wiki).

would it be building a huge black list, ignoring tlds
until you find a computational method or??? thoughts anyone?


If you really want to distinguish by language, then you need to use language 
recognition, not TLD filtering. A .hr (Croatia) site/page could easily be all 
in English, for instance, and a .com site may easily be in Chinese.

Otis





-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Limiting search/crawl to specific language

Reply via email to