Andrzej Bialecki wrote:
In the 0.7 branch, whenever a segment was generated the WebDB was modified, so that the entries that ended up in the fetchlist wouldn't be immediately available to the next segment generation, if that happened before the WebDB was updated with the data from that first segment. This was achieved by adding 1 week to the next fetchTime on a Page.

I can't see that we do it in the trunk. This means that we cannot generate more than one fetchlist between the CrawlDB updates, because each fetchlist would be identical to the previous one... Should we worry about this? There is a cost to modify the CrawlDB, but there is also a cost to not be able to generate multiple different fetchlists and fetch them in parallel...

I think this would be a useful feature to resurrect. I'd vote for making it optional, at least at first.

Ideally one could run crawldb update and generate jobs in parallel with the fetch job, so that, as soon as a fetch completes the next can start.

Doug


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to