Hello-
I don't know if this is the same problem, but as I reported a couple of
days ago I am seeing very disproportionate times in generate times. I have
been able to generate urls in minutes or many hours. I think this is a bug
in the current version of Nutch, but I have not been able to track it down
yet.
In my case, when generate is acting slowly it seems to generate a bunch
of urls then pause for a second, over and over again. When acting quickly
it just generates in batch. Try changing logging to debug and watch the
processing of urls. If you see a scroll-halt-scroll-halt pattern, you are
seeing the same behavior I am seeing. If you just see constant scroll, then
the problem is not present, and you should get quick results.
thanks
-Jim
----- Original Message -----
From: "Marcin Okraszewski" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Thursday, September 06, 2007 12:28 PM
Subject: Re: Re: Effect of no topN argument in generate
According to
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch%20generate
the value is Long.MAX_VALUE.
Do you run both tests in the same conditions? Or maybe you have first run
the crawl with topN 2000 and then without the parameter on the same crawl
db? It may happen that there is not so much to crawl anymore ...
Regards,
Marcin
I have not added any such thing in my nutch-site.xml and I have
omitted -topN argument in bin/generate command.
So my question is what would be the effect in this case. I was
expecting that it would be same as -topN <infinity>. So it should
generate all possible URLs in the generate phase.
I tried omitting topN value in my crawl script and I find that my
crawl is running much faster. Earlier I had a -topN 2000 argument and
it used to take 4-5 days to finish a crawl of depth 5.
Now, without the topN argument, it finished a crawl of depth 5 in 6
hours. How?
On 9/7/07, Rikard Lindner <[EMAIL PROTECTED]> wrote:
> Now im getting a bit uncertain but i think you can add crawl.topN in
> your
> nutch-site.xml, i couldnt find it in nutch-default either but im quite
> sure
> it is set somerwhere!
>
> /Rikard
>
> 2007/9/6, Smith Norton <[EMAIL PROTECTED]>:
> >
> > Thanks for the response. What is the property name for this default
> > value of topN in nutch-default.xml?
> >
> > On 9/6/07, Rikard Lindner <[EMAIL PROTECTED]> wrote:
> > > There is a default value in nutch-default.xml
> > >
> > > /Rikard
> > >
> > > 2007/9/6, Smith Norton <[EMAIL PROTECTED]>:
> > > >
> > > > In the bin/generate command, if I omit the 'topN' argument, what
> > > > is
> > > > the behavior?
> > > >
> > > > Does it generate all possible URLs or does it assume a default
> > > > topN
> > value?
> > > >
> > > > I tried omitting topN value in my crawl script and I find that my
> > > > crawl is running much faster. Earlier I had a -topN 2000 argument
> > > > and
> > > > it used to take 4-5 days to finish a crawl of depth 5.
> > > >
> > > > Now, without the topN argument, it finished a crawl of depth 5 in
> > > > 6
> > > > hours. Can anyone explain what's going on?
> > > >
> > >
> >
>