Hi, ----- Original Message ----
> From: Siddhartha Reddy <[EMAIL PROTECTED]> > To: [email protected] > Sent: Wednesday, April 23, 2008 12:49:07 AM > Subject: Re: Fetching inefficiency > > I have observed a significant improvement after setting > generate.max.per.host to 1000. Earlier, one of my fetch job for a few > thousand pages went on for days because of a couple of sites that were too > slow. For the same crawl, I am now using a generate.max.per.host of 1000 and > each fetch job finishes in about 3hrs for around 30,000 pages while the > other jobs -- generate, parse, updatedb -- take up another hour. > > You are right about the additional overhead of having more generate jobs. I > am now planning to parallelize the generate jobs with fetch (by using > numFetchers that is less then the number of map tasks available) and am > hoping that it would offset the time for the additional generates. Great. Could you please let us know if using the recipe on http://wiki.apache.org/nutch/FetchCycleOverlap helped and how much, roughly? > The cost of setting up the MapReduce jobs might in fact become a significant > one if I reduce the generate.max.per.hosts even further (or it might even be > quite a lot and I am just not noticing.) I will be doing some > experimentation to find the optimum point; but the results might be too > specific to my current crawl. > > On my first attempt, I could not apply the NUTCH-570 patch, so I left it for > later. Anyways, as long as I am using a small generate.max.per.host I doubt > that it would help much. I can send you my Generator.java, if you want, it has NUTCH-570 and a few other little changes. > I am using NUTCH-629 but I am not sure how to measure if it is offering any > improvements. I think the same way you described in the first paragraph - by looking at the total time it took for the fetch job to complete, or perhaps simply by looking at pg/sec rates and eyeballing. The idea there is that if requests to a host keep timing out, there is no point in wasting time requesting more pages from it. This really only pays off if hosts with lots of URLs in the fetchlists time out. There is no point in dropping hosts with only a few URLs, as even with time outs those will be processed quickly. It is those with lots of pages and that keep timing out that are the problem. So you should see the greatest benefit in those cases. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > On Wed, Apr 23, 2008 at 9:29 AM, wrote: > > > Siddhartha, > > > > I think decreasing generate.max.per.host will limit the 'wait time' for > > each fetch run, but I have a feeling that the overall time will be roughly > > the same. As a matter of fact, it may be even higher, because you'll have > > to run generate more times, and if your fetch jobs are too short, you will > > be spending more time waiting on MapReduce jobs (JVM instantiation, job > > initialization....) > > > > > > Have you tried NUTCH-570? I know it doesn't break anything, but I have > > not been able to see its positive effects - likely because my fetch cycles > > are dominated by those slow servers with lots of pages and not by wait time > > between subsequent requests to the same server. But I'd love to hear if > > others found NUTCH-570 helpful! > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > ----- Original Message ---- > > > From: Siddhartha Reddy > > > To: [email protected] > > > Sent: Monday, April 21, 2008 4:59:03 PM > > > Subject: Re: Fetching inefficiency > > > > > > I do face a similar problem. I occasionally have some fetch jobs that > > are > > > fetching from less than 100 hosts, the effect is magnified in this case. > > > > > > I have found one workaround for this but I am not sure if this is the > > best > > > possible solution: I set the value of generate.max.per.host to a pretty > > > small value (like 1000) and this reduces the maximum amount of time any > > task > > > is going to be held up due to a particular host. This does increase the > > > number of cycles that are needed to finish a crawl but does solve the > > > mentioned problem. It might even make sense to have an even lower value > > -- I > > > am still experimenting to find a good value myself. > > > > > > In addition, I think NUTCH-629 and NUTCH-570 could help reduce the > > effects > > > of the problem caused by slow servers. > > > > > > Best, > > > Siddhartha Reddy > > > > > > On Tue, Apr 22, 2008 at 1:46 AM, wrote: > > > > > > > Hello, > > > > > > > > I am wondering how others deal with the following, which I see as > > fetching > > > > inefficiency: > > > > > > > > > > > > When fetching, the fetchlist is broken up into multiple parts and > > fetchers > > > > on cluster nodes start fetching. Some fetchers end up fetching from > > fast > > > > servers, and some from very very slow servers. Those fetching from > > slow > > > > servers take a long time to complete and prolong the whole fetching > > process. > > > > For instance, I've seen tasks from the same fetch job finish in only > > 1-2 > > > > hours, and others in 10 hours. Those taking 10 hours were stuck > > fetching > > > > pages from a single or handful of slow sites. If you have two nodes > > doing > > > > the fetching and one is stuck with a slow server, the other one is > > idling > > > > and wasting time. The node stuck with the slow server is also > > > > underutilized, as it's slowly fetching from only 1 server instead of > > many. > > > > > > > > I imagine anyone using Nutch is seeing the same. If not, what's the > > > > trick? > > > > > > > > I have not tried overlapping fetching jobs yet, but I have a feeling > > that > > > > won't help a ton, plus it could lead to two fetchers fetching from the > > same > > > > server and being impolite - am I wrong? > > > > > > > > Thanks, > > > > Otis > > > > -- > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > > > > > > > > > > -- > > > http://sids.in > > > "If you are not having fun, you are not doing it right." > > > > > > > -- > http://sids.in > "If you are not having fun, you are not doing it right."
