Hi all, A newbie question about interrupting a crawl, using completed segments and recovering URL's crawled so far in interrupted segment. I have run nutch 0.7.2 using the intranet crawl all of 3 or 4 times so please pardon any FAQ's ness. I am using in intranet mode but crawling the Web with seed of ~10 sites. I was attempting a depth 5 crawl but wish to interrupt it for the following reasons
a) it is going to go on for way too long b) from the urls flying past in the fetcher output, I see more and more noise and very few relevant urls So essentially I am crawling noise .... I have 4 completed levels, the depth 5 crawl is running. If I interrupt at this stage can I a) just delete the last segment and be able to use the nutch web app on the 4 segments or is some more clean up required. ( depth 4 segment has already been used in updatedb ... ) when I do "bin/nutch readdb db -stats" I get results identical to the output at the end of the depth 4 crawl so answer to a) seems 'yes' but I am not clear about the cleanup. b) recover the URL's in the interrupted segment ? Thanks for all the help, Nitin Borwankar ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general