Hi, I am using nutch0.8-dev. I have a small shell script for generate/fetch/update cycle. I used generate command with -topN 500. After crawling about 2000 pages I changed -topN to 3 (yes three pages only) to see what pages are crawled.
I found that generate/fetch/update cycles are always crawling the same three pages! I would expect that it should crawl different pages in every cycle (and we have more then 3 pages on intranet and I am sure I injected enough link food). Can anybody tell me what am I doing wrong? Here are some details of my shell script: ================================= #!/bin/bash d=crawl.test depth=3 while [ $depth -gt 0 ]; do bin/nutch generate $d/crawldb $d/segments -topN 3 s=`ls -d $d/segments/2* | tail -1` bin/nutch fetch $s bin/nutch updatedb $d/crawldb $s let depth-=1 done ================================= Regards, Lukas
