[Nutch-dev] Re: Per-page crawling policy

Stefan Groschupf Thu, 05 Jan 2006 06:56:12 -0800

I like the idea and it is another step in the direction of verticalsearch, where I personal see the biggest chance for nutch.

How to implement it? Surprisingly, I think that it's very simple -just adding a CrawlDatum.policyId field would suffice, assuming wehave a means to store and retrieve these policies by ID; and theninstantiate it and call appropriate methods whenever we use todaythe URLFilters and do the score calculations.

Before we start adding meta data and more meta data, why not once ingeneral adding meta data to the crawlDatum, than we can have anykinds of plugins that add and process metadata that belongs to a url.

Beside policyId, I see much more canidates for crawl metadata:

Last Crawl date. Url category. collection key (similar to googleappliance collections) etc.

All solutions I had seen until today load this kind of meta datauntil indexing from a third party data source (database) and add itinto the index. This works but is very slow.


Stefan



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Per-page crawling policy

Reply via email to