Hi, I crawled a website. Around 500 out of 5000 pages generated errors/exceptions. I would like to recrawl only these 500 pages. The errors appear to be something similar to this:
Segment#1: 0 errors Segment#2: 120 errors Segment#3: 10 errors Segment#4: 370 errors Segment#5: 0 errors Q1: If I want to crawl the 500 urls, I just have to re-crawl all the urls in Segment#2, #3 and #4? How do I do this? Q2: Say, Segment#3 has around 1000urls. Only 10 of them generated errors. If I ask nutch to recrawl the same segment, will it just recrawl all the urls? In this case, it might be inefficient. Does it have the ways to check if the content was modified like using last modified http header? Does anybody have suggestions on how to get around this problem? Thanks, Karthik -- View this message in context: http://www.nabble.com/Recrawl-help-tf3717887.html#a10401361 Sent from the Nutch - Dev mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers