Hi Otis,
> Great. Could you please let us know if using the recipe on > http://wiki.apache.org/nutch/FetchCycleOverlap helped and how much, > roughly? > I am trying a slightly different strategy: I am going to run the generate jobs in parallel with the fetch job. As for running updatedb in parallel with the fetch job, I am not too sure -- updatedb can take a list of segments, won't it be better to update all of them together? In any case, I will report on any improvements I get. > On my first attempt, I could not apply the NUTCH-570 patch, so I left it > for > > later. Anyways, as long as I am using a small generate.max.per.host I > doubt > > that it would help much. > > I can send you my Generator.java, if you want, it has NUTCH-570 and a few > other > little changes. > Thanks, that would really help me; can you please send it to me? > I am using NUTCH-629 but I am not sure how to measure if it is offering > any > > improvements. > > I think the same way you described in the first paragraph - by looking at > the > total time it took for the fetch job to complete, or perhaps simply by > looking at > pg/sec rates and eyeballing. The idea there is that if requests to a host > keep > timing out, there is no point in wasting time requesting more pages from > it. > This really only pays off if hosts with lots of URLs in the fetchlists > time out. > There is no point in dropping hosts with only a few URLs, as even with > time outs > those will be processed quickly. It is those with lots of pages and that > keep > timing out that are the problem. So you should see the greatest benefit > in > those cases. > The problem is that the URLs from the hosts on the slow servers are all already fetched or timed out and I do not wish to hit the same URLs again. Perhaps I can just dump the crawldb and take a look at the metadata. Thanks, Siddhartha On Wed, Apr 23, 2008 at 9:00 PM, <[EMAIL PROTECTED]> wrote: > Hi, > > ----- Original Message ---- > > > From: Siddhartha Reddy <[EMAIL PROTECTED]> > > To: [email protected] > > Sent: Wednesday, April 23, 2008 12:49:07 AM > > Subject: Re: Fetching inefficiency > > > > I have observed a significant improvement after setting > > generate.max.per.host to 1000. Earlier, one of my fetch job for a few > > thousand pages went on for days because of a couple of sites that were > too > > slow. For the same crawl, I am now using a generate.max.per.host of 1000 > and > > each fetch job finishes in about 3hrs for around 30,000 pages while the > > other jobs -- generate, parse, updatedb -- take up another hour. > > > > You are right about the additional overhead of having more generate > jobs. I > > am now planning to parallelize the generate jobs with fetch (by using > > numFetchers that is less then the number of map tasks available) and am > > hoping that it would offset the time for the additional generates. > > Great. Could you please let us know if using the recipe on > http://wiki.apache.org/nutch/FetchCycleOverlap helped and how much, > roughly? > > > The cost of setting up the MapReduce jobs might in fact become a > significant > > one if I reduce the generate.max.per.hosts even further (or it might > even be > > quite a lot and I am just not noticing.) I will be doing some > > experimentation to find the optimum point; but the results might be too > > specific to my current crawl. > > > > On my first attempt, I could not apply the NUTCH-570 patch, so I left it > for > > later. Anyways, as long as I am using a small generate.max.per.host I > doubt > > that it would help much. > > I can send you my Generator.java, if you want, it has NUTCH-570 and a few > other > little changes. > > > I am using NUTCH-629 but I am not sure how to measure if it is offering > any > > improvements. > > I think the same way you described in the first paragraph - by looking at > the > total time it took for the fetch job to complete, or perhaps simply by > looking at > pg/sec rates and eyeballing. The idea there is that if requests to a host > keep > timing out, there is no point in wasting time requesting more pages from > it. > This really only pays off if hosts with lots of URLs in the fetchlists > time out. > There is no point in dropping hosts with only a few URLs, as even with > time outs > those will be processed quickly. It is those with lots of pages and that > keep > timing out that are the problem. So you should see the greatest benefit > in > those cases. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > On Wed, Apr 23, 2008 at 9:29 AM, wrote: > > > > > Siddhartha, > > > > > > I think decreasing generate.max.per.host will limit the 'wait time' > for > > > each fetch run, but I have a feeling that the overall time will be > roughly > > > the same. As a matter of fact, it may be even higher, because you'll > have > > > to run generate more times, and if your fetch jobs are too short, you > will > > > be spending more time waiting on MapReduce jobs (JVM instantiation, > job > > > initialization....) > > > > > > > > > Have you tried NUTCH-570? I know it doesn't break anything, but I > have > > > not been able to see its positive effects - likely because my fetch > cycles > > > are dominated by those slow servers with lots of pages and not by wait > time > > > between subsequent requests to the same server. But I'd love to hear > if > > > others found NUTCH-570 helpful! > > > > > > Otis > > > -- > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > > ----- Original Message ---- > > > > From: Siddhartha Reddy > > > > To: [email protected] > > > > Sent: Monday, April 21, 2008 4:59:03 PM > > > > Subject: Re: Fetching inefficiency > > > > > > > > I do face a similar problem. I occasionally have some fetch jobs > that > > > are > > > > fetching from less than 100 hosts, the effect is magnified in this > case. > > > > > > > > I have found one workaround for this but I am not sure if this is > the > > > best > > > > possible solution: I set the value of generate.max.per.host to a > pretty > > > > small value (like 1000) and this reduces the maximum amount of time > any > > > task > > > > is going to be held up due to a particular host. This does increase > the > > > > number of cycles that are needed to finish a crawl but does solve > the > > > > mentioned problem. It might even make sense to have an even lower > value > > > -- I > > > > am still experimenting to find a good value myself. > > > > > > > > In addition, I think NUTCH-629 and NUTCH-570 could help reduce the > > > effects > > > > of the problem caused by slow servers. > > > > > > > > Best, > > > > Siddhartha Reddy > > > > > > > > On Tue, Apr 22, 2008 at 1:46 AM, wrote: > > > > > > > > > Hello, > > > > > > > > > > I am wondering how others deal with the following, which I see as > > > fetching > > > > > inefficiency: > > > > > > > > > > > > > > > When fetching, the fetchlist is broken up into multiple parts and > > > fetchers > > > > > on cluster nodes start fetching. Some fetchers end up fetching > from > > > fast > > > > > servers, and some from very very slow servers. Those fetching > from > > > slow > > > > > servers take a long time to complete and prolong the whole > fetching > > > process. > > > > > For instance, I've seen tasks from the same fetch job finish in > only > > > 1-2 > > > > > hours, and others in 10 hours. Those taking 10 hours were stuck > > > fetching > > > > > pages from a single or handful of slow sites. If you have two > nodes > > > doing > > > > > the fetching and one is stuck with a slow server, the other one is > > > idling > > > > > and wasting time. The node stuck with the slow server is also > > > > > underutilized, as it's slowly fetching from only 1 server instead > of > > > many. > > > > > > > > > > I imagine anyone using Nutch is seeing the same. If not, what's > the > > > > > trick? > > > > > > > > > > I have not tried overlapping fetching jobs yet, but I have a > feeling > > > that > > > > > won't help a ton, plus it could lead to two fetchers fetching from > the > > > same > > > > > server and being impolite - am I wrong? > > > > > > > > > > Thanks, > > > > > Otis > > > > > -- > > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > > > > > > > > > > > > > > > -- > > > > http://sids.in > > > > "If you are not having fun, you are not doing it right." > > > > > > > > > > > > -- > > http://sids.in > > "If you are not having fun, you are not doing it right." > > -- http://sids.in "If you are not having fun, you are not doing it right."
