[Nutch-general] 0.7.2 segment behavior on interrupted crawl

Nitin Borwankar Wed, 15 Nov 2006 11:50:59 -0800

Hi all,

A newbie question about interrupting a crawl, using completed segments 
and  recovering URL's crawled so far in interrupted segment.
I have run nutch 0.7.2 using the intranet crawl all of 3 or 4 times so 
please pardon any FAQ's ness.
I am using in intranet mode but crawling the Web with seed of ~10 sites.
I was attempting a depth 5 crawl but wish to interrupt it for the 
following reasons


a) it is going to go on for way too long 
b) from the urls flying past in the fetcher output, I see more and more 
noise and very few relevant urls

So essentially I am crawling noise ....

I have 4 completed levels, the depth 5 crawl is running. If I interrupt 
at this stage can I

a) just delete the last segment and be able to use the nutch web app on 
the 4 segments or is some more clean up required.
   ( depth 4 segment has already been used in updatedb ... )
   when I do "bin/nutch readdb db -stats" I get results identical to the 
output at the end of the depth 4 crawl so answer to a) seems 'yes' but I 
am not clear about the cleanup.

b) recover the URL's in the interrupted segment ?


Thanks for all the help,

Nitin Borwankar

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] 0.7.2 segment behavior on interrupted crawl

Reply via email to