Re: Why Nutch is not crawling all links from web page

Susam Pal Mon, 05 Apr 2010 03:25:25 -0700

On Mon, Apr 5, 2010 at 3:32 PM, Anil Kumar <a...@nexusemp.com> wrote:
> Hi
> I'm using Nutch crawler in my project.
>
> I scrubbing the data from one of the site
> which have multiple links in that page leads
> to another web pages.
>
> Nutch does not crawling the all links.
>
>
> Help me to resolve this problem.
>
>
> Thanks
> ANIL KUMAR
>


Hi,

What is the 'depth' of the crawl? What is the 'topN' value you have provided?

Note that this parameters are given as '-depth' and '-topN' arguments
when you invoke 'bin/nutch crawl ...'.

The crawling steps usually go like this.

Depth 1. Read the seed URLs and crawl them.
Depth 2. From the list of URLs found in the pages fetched in the crawl
at depth 1, select the top topN number of URLs and crawl them.
Depth 3. From the list of URLs found in the pages fetched in the crawl
at depth 2, select the top topN number of URLs and crawl them.
.... and so on until we reach the 'depth' specified in the 'bin/nutch
crawl' command.

So, if the page containing the multiple links is fetched at depth n,
you have to ensure that the 'depth' value you are specifying is at
least n + 1 so that the multiple links are fetched in the next crawl.

If 'depth' value is sufficient, it means that your crawler is
sacrificing these multiple links because they have a very low rank in
which case you might want to increase the 'topN' value.

Hope this helps you.

Regards,
Susam Pal

Re: Why Nutch is not crawling all links from web page

Reply via email to