Hi, Thank you for your suggestion. I have around 500+ internet urls configured for crawling and crawl process is running in Amazon cloud. I have already reduced my depth to 8, topN to 1000 and also increased fetcher threads to 150 and limited 50 urls per host using generate.max.per.host property. With this configuration Generate, Fetch, Parse, Update completes in max 10 hrs. When comes to segment merge it takes lot of time. As a temporary solution I am not doing the segment merge and directly indexing the fetched segments. With this solution I am able to finish the crawl process with in 24hrs. Now I am looking for long term solution to optimize segment merge process.
Regards Ashokkumar.R -----Original Message----- From: Susam Pal [mailto:susam....@gmail.com] Sent: Monday, April 05, 2010 7:48 PM To: firstname.lastname@example.org Subject: Re: Nutch segment merge is very slow On Mon, Apr 5, 2010 at 5:27 PM, <ashokkumar.raveendi...@wipro.com> wrote: > Hi > > I'm using Nutch crawler in my project and crawled more than 2GB of data > using Nutch runbot script. Up to 2GB segment merger has took and ended > with in 24 hrs but now it takes more than 48 hrs and still running. I > have set depth to 16 and topN to 2500. I want to run crawler every day > as per my requirement. > > > > How to speed up segment merge and index process. > > > > Regards > > Ashokkumar.R > > Hi, >From my experience of running Nutch on a single box to crawl a corporate intranet with a depth as high as 16 and a topN value greater than 1000, I feel it isn't feasible to have one crawl per day. One of these options might help you. 1. Run Nutch on a Hadoop cluster to distribute the job and speed up processing. 2. Reduce the crawl depth to about 7, 8, or whatever works for you. This means you wouldn't be crawling links discovered in the crawl perform at depth 8. This may be a good or a bad thing for you depending on whether you want to crawl URLs found so deep in the crawl. These URLs may be obscure and less important because they are so many "hops" away from your seed URLs. ` 3. However, if the URLs found very deep are also important and you want to crawl them, you might have to sacrifice low ranking URLs by setting a smaller topN value, say, 1000, or whatever works for you. Regards, Susam Pal Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com