Re: Re: Effect of no topN argument in generate

Smith Norton Fri, 07 Sep 2007 00:33:21 -0700

There is a little difference in the condition.

A. First condition when a complete crawl of depth 5 takes around 5 days:-


  1. Only 7 URLs in the seed URL file 'urls/url'.
  2. -topN 2000 is the argument to generate

B. Second condition when a complete crawl of depth 5 takes around 6 hours:-

  1. Around 60 URLs in the seed URL file 'urls/url'.
  2. No '-topN' argument for generate. This argument is omitted.

I would also like to mention what the extra 53 URLs are in case B.

In case A, there is one url called  'http://central/'. The home page
of "http://central/"; has a side bar with lots of URLs to other
important pages of the 'central' site. As with most sidebars, this set
of sidebar URLs appear in all pages of 'central' site.

I picked up these sidebar URLs (which happens to be 53 in number) and
placed them in the seed URLs file in case B.

Can anyone explain why case B should drastically reduce crawl duration
from 5 days to 6 hours?

On 9/7/07, Marcin Okraszewski <[EMAIL PROTECTED]> wrote:
> According to http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch%20generate
> the value is Long.MAX_VALUE.
>
> Do you run both tests in the same conditions? Or maybe you have first run the 
> crawl with topN 2000 and then without the parameter on the same crawl db? It 
> may happen that there is not so much to crawl anymore ...
>
> Regards,
> Marcin
>
>
> > I have not added any such thing in my nutch-site.xml and I have
> > omitted -topN argument in bin/generate command.
> >
> > So my question is what would be the effect in this case. I was
> > expecting that it would be same as -topN <infinity>. So it should
> > generate all possible URLs in the generate phase.
> >
> > I tried omitting topN value in my crawl script and I find that my
> > crawl is running much faster. Earlier I had a -topN 2000 argument and
> > it used to take 4-5 days to finish a crawl of depth 5.
> >
> > Now, without the topN argument, it finished a crawl of depth 5 in 6
> > hours. How?
> >
> > On 9/7/07, Rikard Lindner <[EMAIL PROTECTED]> wrote:
> > > Now im getting a bit uncertain but i think you can add crawl.topN in your
> > > nutch-site.xml, i couldnt find it in nutch-default either but im quite 
> > > sure
> > > it is set somerwhere!
> > >
> > > /Rikard
> > >
> > > 2007/9/6, Smith Norton <[EMAIL PROTECTED]>:
> > > >
> > > > Thanks for the response. What is the property name for this default
> > > > value of topN in nutch-default.xml?
> > > >
> > > > On 9/6/07, Rikard Lindner <[EMAIL PROTECTED]> wrote:
> > > > > There is a default value in nutch-default.xml
> > > > >
> > > > > /Rikard
> > > > >
> > > > > 2007/9/6, Smith Norton <[EMAIL PROTECTED]>:
> > > > > >
> > > > > > In the bin/generate command, if I omit the 'topN' argument, what is
> > > > > > the behavior?
> > > > > >
> > > > > > Does it generate all possible URLs or does it assume a default topN
> > > > value?
> > > > > >
> > > > > > I tried omitting topN value in my crawl script and I find that my
> > > > > > crawl is running much faster. Earlier I had a -topN 2000 argument 
> > > > > > and
> > > > > > it used to take 4-5 days to finish a crawl of depth 5.
> > > > > >
> > > > > > Now, without the topN argument, it finished a crawl of depth 5 in 6
> > > > > > hours. Can anyone explain what's going on?
> > > > > >
> > > > >
> > > >
> > >
>
>

Re: Re: Effect of no topN argument in generate

Reply via email to