[Nutch-dev] Recrawl help

karthik085 Wed, 09 May 2007 12:42:01 -0700

Hi,

I crawled a website. Around 500 out of 5000 pages generated
errors/exceptions. I would like to recrawl only these 500 pages. The errors
appear to be something similar to this:


Segment#1: 0 errors
Segment#2: 120 errors
Segment#3: 10 errors
Segment#4: 370 errors
Segment#5: 0 errors

Q1: If I want to crawl the 500 urls, I just have to re-crawl all the urls in
Segment#2, #3 and #4? How do I do this?

Q2: Say, Segment#3 has around 1000urls. Only 10 of them generated errors. If
I ask nutch to recrawl the same segment, will it just recrawl all the urls?
In this case, it might be inefficient. Does it have the ways to check if the
content was modified like using last modified http header? Does anybody have
suggestions on how to get around this problem?

Thanks,
Karthik
-- 
View this message in context: 
http://www.nabble.com/Recrawl-help-tf3717887.html#a10401361
Sent from the Nutch - Dev mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Recrawl help

Reply via email to