Is it possible to use the webdb (link db) to add rules
to as you recursively process your data based upon
link analysis?
Such as adding a field for language so you can either
score based on language (using language as your
identifying "top" to build your link analysis from) or
simply ignoring a language/tld or some field all
together from generate process (essentially not
refetching or pragmatically refetching or scaling your
fetch on some priority)?
1. As others have noted, you want to run analysis code on the fetched
page contents to identify (guess at) the language. This kind of code
typically uses statistical models to generate probabilities for a set
of languages that it's been trained on.
There was a past thread about this on Nutch developer list.
2. We modified Nutch to do something similar, at the point where
outlinks are harvested. We stuff this special score into the page's
nextScore field, since this isn't used other than for the (too slow)
link analysis tool. Note that this was with the 0.7 code base.
3. Then in FetchListTool we use nextScore to order links for
fetching. So for you, if the language of the page was used to
calculate this score, then you could focus your crawl on
English-content pages.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general