[Nutch-general] Re: Limiting search/crawl to specific language

Ken Krugler Wed, 04 Jan 2006 10:05:14 -0800

Is it possible to use the webdb (link db) to add rules
to as you recursively process your data based upon
link analysis?


Such as adding a field for language so you can either
score based on language (using language as your
identifying "top" to build your link analysis from) or
simply ignoring a language/tld or some field all
together from generate process (essentially not
refetching or pragmatically refetching or scaling your
fetch on some priority)?

1. As others have noted, you want to run analysis code on the fetchedpage contents to identify (guess at) the language. This kind of codetypically uses statistical models to generate probabilities for a setof languages that it's been trained on.


There was a past thread about this on Nutch developer list.

2. We modified Nutch to do something similar, at the point whereoutlinks are harvested. We stuff this special score into the page'snextScore field, since this isn't used other than for the (too slow)link analysis tool. Note that this was with the 0.7 code base.

3. Then in FetchListTool we use nextScore to order links forfetching. So for you, if the language of the page was used tocalculate this score, then you could focus your crawl onEnglish-content pages.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Limiting search/crawl to specific language

Reply via email to