Hi,

I am using nutch0.8-dev. I have a small shell script for
generate/fetch/update cycle. I used generate command with -topN 500.
After crawling about 2000 pages I changed -topN to 3 (yes three pages
only) to see what pages are crawled.

I found that generate/fetch/update cycles are always crawling the same
three pages!
I would expect that it should crawl different pages in every cycle
(and we have more then 3 pages on intranet and I am sure I injected
enough link food).

Can anybody tell me what am I doing wrong?

Here are some details of my shell script:
=================================
#!/bin/bash

d=crawl.test
depth=3

while [ $depth -gt 0  ]; do

bin/nutch generate $d/crawldb $d/segments -topN 3
s=`ls -d $d/segments/2* | tail -1`

bin/nutch fetch $s

bin/nutch updatedb $d/crawldb $s

let depth-=1
done
=================================

Regards,
Lukas

Reply via email to