If you download the most recent version of Nutch from SVN, the newer
CrawlTool doesn't fetch pages twice.

As far as limiting the number of pages to crawl, you can use the -topN
flag when generating your segments.

Andy

On 5/26/05, Ian Reardon <[EMAIL PROTECTED]> wrote:
> I have been crawling rather large sites ( larger then 10k pages) with
> the crawl command.   It seems like it crawls all the pages twice.  Is
> that normal?  I thought it was just removing the segments but it looks
> like it crawls all the pages, does some update to the DB and then
> crawls them again.  If anyone could shed some light on this I would
> appreciate it.
> 
> 2nd Question.  Is there a way to limit a crawl to number of pages
> rather then depth?  I would like to limit a crawl to say 100 pages,
> 1000 pages of whatever.  I could brute force it by writing a script to
> look at the logs and then killing the crawler but I'd rather not go
> that approach.
> 
> Thanks.
> 
> Ian
>

Reply via email to