RE: Nutch segment merge is very slow

Arkadi.Kosmynin Mon, 05 Apr 2010 15:35:53 -0700

Hi,

> -----Original Message-----
> From: Susam Pal [mailto:[email protected]]
> Sent: Tuesday, 6 April 2010 12:18 AM
> To: [email protected]
> Subject: Re: Nutch segment merge is very slow
> 
> On Mon, Apr 5, 2010 at 5:27 PM, <[email protected]>
> wrote:
> 
> > Hi
> >
> > I'm using Nutch crawler in my project and crawled more than 2GB of
> data
> > using Nutch runbot script. Up to 2GB segment merger has took and
> ended
> > with in  24 hrs but now it takes more than 48 hrs and still running.
> I
> > have set depth to 16 and topN to 2500. I want to run crawler every
> day
> > as per my requirement.
> >
> >
> >
> > How to speed up segment merge and index process.
> >
> >
> >
> > Regards
> >
> > Ashokkumar.R
> >
> >
> Hi,
> 
> From my experience of running Nutch on a single box to crawl a
> corporate
> intranet with a depth as high as 16 and a topN value greater than 1000,
> I
> feel it isn't feasible to have one crawl per day.


That is, if you consider your site a monolithic object and try to recrawl the 
whole site each time. Normally, web sites are not homogeneous. To keep your 
index up to date, you only have to recrawl regularly the parts that change 
fast, which is 1% to 10% of the site, and refresh other parts perhaps once in 
every few months.

> 
> One of these options might help you.
> 
> 1. Run Nutch on a Hadoop cluster to distribute the job and speed up
> processing.

Then you have to run your web server on a cluster as well because it will have 
to serve all contents of your site every day + serve other clients.

> 
> 2. Reduce the crawl depth to about 7, 8, or whatever works for you.
> This
> means you wouldn't be crawling links discovered in the crawl perform at
> depth 8. This may be a good or a bad thing for you depending on whether
> you
> want to crawl URLs found so deep in the crawl. These URLs may be
> obscure and
> less important because they are so many "hops" away from your seed
> URLs.

Loosing quality. Can it be a good thing?

> `
> 3. However, if the URLs found very deep are also important and you want
> to
> crawl them, you might have to sacrifice low ranking URLs by setting a
> smaller topN value, say, 1000, or whatever works for you.

Loosing quality in a different way. How do you calculate ranks? Link based 
methods do not work nearly as well on intranets as on the global Web. 

 
> Regards,
> Susam Pal

Regards,

Arkadi Kosmynin
CSIRO Astronomy and Space Science

RE: Nutch segment merge is very slow

Reply via email to