Re: Only one URL per site is selected from the URL file

Smith Norton Fri, 07 Sep 2007 01:19:13 -0700

I find this in the logs:-

2007-09-06 17:13:54,707 INFO  crawl.Generator - Generator:
Partitioning selected urls by host, for politeness.


Is this why lots of URLs from the same host are being ignored? If it
partitions, shouldn't it remember the unselected URLs to be crawled
later.

Can someone please explain read the two mails below, this one and help
me to understand what's going on?

On 9/7/07, Smith Norton <[EMAIL PROTECTED]> wrote:
> The facts in the earlier mail is slightly wrong. It's not exactly one
> URL per site. But all the URLs mentioned in the URL files are not
> processed. Like out of 53 URLs of the same site, only 3 or 4 were
> processed. Why so?
>
> Is this a known bug or a behavior of Nutch? Can this behavior be changed?
>
> On 9/7/07, Smith Norton <[EMAIL PROTECTED]> wrote:
> > I have mentioned a around 53 URLs from the same site and 7 other URLs
> > from different sites in the seed-urls file 'urls/url'.
> >
> > They were like:-
> >
> > http://central/s1
> > http://central/s1/t
> > http://central/s1/topic1
> > http://central/s1/topic2
> > http://central/s1/topic3
> > and so on ....
> >
> > I was expecting when I begin the crawl, at depth 1 all these URLs
> > would be fetched. But I find that in the first depth, only
> > http://centrals/s1 was crawled. And the other 7 URLs from distinct
> > sites were also crawled.
> >
> > My first question:-
> >
> > It seems it is selecting one URL per site for the first depth of
> > crawl. Please explain why is it so? How can I change the behavior so
> > that it crawls all URLs I mention in the seed-urls file.
> >
> > My second question:-
> >
> > Not only in the first depth, the other central urls were never called
> > in any of the subsequent depths. Why so?
> >
>

Re: Only one URL per site is selected from the URL file

Reply via email to