On Mon, Apr 5, 2010 at 5:27 PM, <ashokkumar.raveendi...@wipro.com> wrote:

> Hi
>
> I'm using Nutch crawler in my project and crawled more than 2GB of data
> using Nutch runbot script. Up to 2GB segment merger has took and ended
> with in  24 hrs but now it takes more than 48 hrs and still running. I
> have set depth to 16 and topN to 2500. I want to run crawler every day
> as per my requirement.
>
>
>
> How to speed up segment merge and index process.
>
>
>
> Regards
>
> Ashokkumar.R
>
>
Hi,

>From my experience of running Nutch on a single box to crawl a corporate
intranet with a depth as high as 16 and a topN value greater than 1000, I
feel it isn't feasible to have one crawl per day.

One of these options might help you.

1. Run Nutch on a Hadoop cluster to distribute the job and speed up
processing.

2. Reduce the crawl depth to about 7, 8, or whatever works for you. This
means you wouldn't be crawling links discovered in the crawl perform at
depth 8. This may be a good or a bad thing for you depending on whether you
want to crawl URLs found so deep in the crawl. These URLs may be obscure and
less important because they are so many "hops" away from your seed URLs.
`
3. However, if the URLs found very deep are also important and you want to
crawl them, you might have to sacrifice low ranking URLs by setting a
smaller topN value, say, 1000, or whatever works for you.

Regards,
Susam Pal

Reply via email to