On Mon, Apr 5, 2010 at 3:32 PM, Anil Kumar <a...@nexusemp.com> wrote: > Hi > I'm using Nutch crawler in my project. > > I scrubbing the data from one of the site > which have multiple links in that page leads > to another web pages. > > Nutch does not crawling the all links. > > > Help me to resolve this problem. > > > Thanks > ANIL KUMAR >
Hi, What is the 'depth' of the crawl? What is the 'topN' value you have provided? Note that this parameters are given as '-depth' and '-topN' arguments when you invoke 'bin/nutch crawl ...'. The crawling steps usually go like this. Depth 1. Read the seed URLs and crawl them. Depth 2. From the list of URLs found in the pages fetched in the crawl at depth 1, select the top topN number of URLs and crawl them. Depth 3. From the list of URLs found in the pages fetched in the crawl at depth 2, select the top topN number of URLs and crawl them. .... and so on until we reach the 'depth' specified in the 'bin/nutch crawl' command. So, if the page containing the multiple links is fetched at depth n, you have to ensure that the 'depth' value you are specifying is at least n + 1 so that the multiple links are fetched in the next crawl. If 'depth' value is sufficient, it means that your crawler is sacrificing these multiple links because they have a very low rank in which case you might want to increase the 'topN' value. Hope this helps you. Regards, Susam Pal