The facts in the earlier mail is slightly wrong. It's not exactly one URL per site. But all the URLs mentioned in the URL files are not processed. Like out of 53 URLs of the same site, only 3 or 4 were processed. Why so?
Is this a known bug or a behavior of Nutch? Can this behavior be changed? On 9/7/07, Smith Norton <[EMAIL PROTECTED]> wrote: > I have mentioned a around 53 URLs from the same site and 7 other URLs > from different sites in the seed-urls file 'urls/url'. > > They were like:- > > http://central/s1 > http://central/s1/t > http://central/s1/topic1 > http://central/s1/topic2 > http://central/s1/topic3 > and so on .... > > I was expecting when I begin the crawl, at depth 1 all these URLs > would be fetched. But I find that in the first depth, only > http://centrals/s1 was crawled. And the other 7 URLs from distinct > sites were also crawled. > > My first question:- > > It seems it is selecting one URL per site for the first depth of > crawl. Please explain why is it so? How can I change the behavior so > that it crawls all URLs I mention in the seed-urls file. > > My second question:- > > Not only in the first depth, the other central urls were never called > in any of the subsequent depths. Why so? >
