[Nutch-dev] Re: Why Crawl failed to fetch so many pages?

Andy Liu Thu, 14 Apr 2005 05:48:36 -0700

By default, Nutch only crawls the first 100 outlinks on a page.  Maybe
that's your problem?


On 4/14/05, Matthias Jaekle <[EMAIL PROTECTED]> wrote:
> > try
> > +^http://news.buaa.edu.cn/*
> This should not be the reason.
> Your regex fits on urls starting with:
> http://news.buaa.edu.cn
> http://news.buaa.edu.cn/
> http://news.buaa.edu.cn//
> http://news.buaa.edu.cn/// ...
> 
> The only thing I would try is to escape some caracters to make it more
> correct. A dot means every possible sign. Better:
> +^http:\/\/news\.buaa\.edu\.cn
> 
> Did you make enough rounds, to get the wanted depth?
> With every crawl you only fetch the already known links.
> 
> Matthias
> 
> --
> http://www.eventax.com - eventax GmbH
> http://www.umkreisfinder.de - Die Suchmaschine f�r Lokales und Events
>


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Why Crawl failed to fetch so many pages?

Reply via email to