Re: Why Crawl failed to fetch so many pages?

Andy Liu Thu, 14 Apr 2005 05:38:59 -0700

By default, Nutch only crawls the first 100 outlinks on a page.  Maybe
that's your problem?


On 4/14/05, Matthias Jaekle <[EMAIL PROTECTED]> wrote:
> > try
> > +^http://news.buaa.edu.cn/*
> This should not be the reason.
> Your regex fits on urls starting with:
> http://news.buaa.edu.cn
> http://news.buaa.edu.cn/
> http://news.buaa.edu.cn//
> http://news.buaa.edu.cn/// ...
> 
> The only thing I would try is to escape some caracters to make it more
> correct. A dot means every possible sign. Better:
> +^http:\/\/news\.buaa\.edu\.cn
> 
> Did you make enough rounds, to get the wanted depth?
> With every crawl you only fetch the already known links.
> 
> Matthias
> 
> --
> http://www.eventax.com - eventax GmbH
> http://www.umkreisfinder.de - Die Suchmaschine f�r Lokales und Events
>

Re: Why Crawl failed to fetch so many pages?

Reply via email to