1. create a segment with the initial list
2. fetch the segment
3. update the database
4. create a new segment with the outlinks from [2]
5. fetch the segement created in [4].
I basically want to repeat steps 2 through 5. How would I do this?
Here's what I have in my script:
bin/nutch generate crawl.test/db crawl.test/segments -topN 20 # Create new
segment
s1=`ls -d crawl.test/segments/2* | tail -1`
bin/nutch fetch $s1
# Fetch it
bin/nutch updatedb crawl.test/db $s1 #
Updatedb with new links
bin/nutch analyze crawl.test/db 5
bin/nutch index $s1
Change the db and segments directories as needed and change topN to suit
your needs. The steps start at a different point than your step 2, but you
probably get the picture. See the Nutch tutorial for more info...