Re: Only one URL per site is selected from the URL file

Smith Norton Fri, 07 Sep 2007 01:00:33 -0700

The facts in the earlier mail is slightly wrong. It's not exactly one
URL per site. But all the URLs mentioned in the URL files are not
processed. Like out of 53 URLs of the same site, only 3 or 4 were
processed. Why so?


Is this a known bug or a behavior of Nutch? Can this behavior be changed?

On 9/7/07, Smith Norton <[EMAIL PROTECTED]> wrote:
> I have mentioned a around 53 URLs from the same site and 7 other URLs
> from different sites in the seed-urls file 'urls/url'.
>
> They were like:-
>
> http://central/s1
> http://central/s1/t
> http://central/s1/topic1
> http://central/s1/topic2
> http://central/s1/topic3
> and so on ....
>
> I was expecting when I begin the crawl, at depth 1 all these URLs
> would be fetched. But I find that in the first depth, only
> http://centrals/s1 was crawled. And the other 7 URLs from distinct
> sites were also crawled.
>
> My first question:-
>
> It seems it is selecting one URL per site for the first depth of
> crawl. Please explain why is it so? How can I change the behavior so
> that it crawls all URLs I mention in the seed-urls file.
>
> My second question:-
>
> Not only in the first depth, the other central urls were never called
> in any of the subsequent depths. Why so?
>

Re: Only one URL per site is selected from the URL file

Reply via email to