Hi Otis,

> Great.  Could you please let us know if using the recipe on
> http://wiki.apache.org/nutch/FetchCycleOverlap helped and how much,
> roughly?
>

I am trying a slightly different strategy: I am going to run the generate
jobs in parallel with the fetch job. As for running updatedb in parallel
with the fetch job, I am not too sure -- updatedb can take a list of
segments, won't it be better to update all of them together? In any case, I
will report on any improvements I get.

> On my first attempt, I could not apply the NUTCH-570 patch, so I left it
> for
> > later. Anyways, as long as I am using a small generate.max.per.host I
> doubt
> > that it would help much.
>
> I can send you my Generator.java, if you want, it has NUTCH-570 and a few
> other
> little changes.
>

Thanks, that would really help me; can you please send it to me?

> I am using NUTCH-629 but I am not sure how to measure if it is offering
> any
> > improvements.
>
> I think the same way you described in the first paragraph - by looking at
> the
> total time it took for the fetch job to complete, or perhaps simply by
> looking at
> pg/sec rates and eyeballing.  The idea there is that if requests to a host
> keep
> timing out, there is no point in wasting time requesting more pages from
> it.
> This really only pays off if hosts with lots of URLs in the fetchlists
> time out.
> There is no point in dropping hosts with only a few URLs, as even with
> time outs
> those will be processed quickly.  It is those with lots of pages and that
> keep
> timing out that are the problem.  So you should see the greatest benefit
> in
> those cases.
>

The problem is that the URLs from the hosts on the slow servers are all
already fetched or timed out and I do not wish to hit the same URLs again.
Perhaps I can just dump the crawldb and take a look at the metadata.

Thanks,
Siddhartha

On Wed, Apr 23, 2008 at 9:00 PM, <[EMAIL PROTECTED]> wrote:

> Hi,
>
>  ----- Original Message ----
>
> > From: Siddhartha Reddy <[EMAIL PROTECTED]>
> > To: [email protected]
> > Sent: Wednesday, April 23, 2008 12:49:07 AM
> > Subject: Re: Fetching inefficiency
> >
> > I have observed a significant improvement after setting
> > generate.max.per.host to 1000. Earlier, one of my fetch job for a few
> > thousand pages went on for days because of a couple of sites that were
> too
> > slow. For the same crawl, I am now using a generate.max.per.host of 1000
> and
> > each fetch job finishes in about 3hrs for around 30,000 pages while the
> > other jobs -- generate, parse, updatedb -- take up another hour.
> >
> > You are right about the additional overhead of having more generate
> jobs. I
> > am now planning to parallelize the generate jobs with fetch (by using
> > numFetchers that is less then the number of map tasks available) and am
> > hoping that it would offset the time for the additional generates.
>
> Great.  Could you please let us know if using the recipe on
> http://wiki.apache.org/nutch/FetchCycleOverlap helped and how much,
> roughly?
>
> > The cost of setting up the MapReduce jobs might in fact become a
> significant
> > one if I reduce the generate.max.per.hosts even further (or it might
> even be
> > quite a lot and I am just not noticing.) I will be doing some
> > experimentation to find the optimum point; but the results might be too
> > specific to my current crawl.
> >
> > On my first attempt, I could not apply the NUTCH-570 patch, so I left it
> for
> > later. Anyways, as long as I am using a small generate.max.per.host I
> doubt
> > that it would help much.
>
> I can send you my Generator.java, if you want, it has NUTCH-570 and a few
> other
> little changes.
>
> > I am using NUTCH-629 but I am not sure how to measure if it is offering
> any
> > improvements.
>
> I think the same way you described in the first paragraph - by looking at
> the
> total time it took for the fetch job to complete, or perhaps simply by
> looking at
> pg/sec rates and eyeballing.  The idea there is that if requests to a host
> keep
> timing out, there is no point in wasting time requesting more pages from
> it.
> This really only pays off if hosts with lots of URLs in the fetchlists
> time out.
> There is no point in dropping hosts with only a few URLs, as even with
> time outs
> those will be processed quickly.  It is those with lots of pages and that
> keep
> timing out that are the problem.  So you should see the greatest benefit
> in
> those cases.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> > On Wed, Apr 23, 2008 at 9:29 AM, wrote:
> >
> > > Siddhartha,
> > >
> > > I think decreasing generate.max.per.host will limit the 'wait time'
> for
> > > each fetch run, but I have a feeling that the overall time will be
> roughly
> > > the same.  As a matter of fact, it may be even higher, because you'll
> have
> > > to run generate more times, and if your fetch jobs are too short, you
> will
> > > be spending more time waiting on MapReduce jobs (JVM instantiation,
> job
> > > initialization....)
> > >
> > >
> > > Have you tried NUTCH-570?  I know it doesn't break anything, but I
> have
> > > not been able to see its positive effects - likely because my fetch
> cycles
> > > are dominated by those slow servers with lots of pages and not by wait
> time
> > > between subsequent requests to the same server.  But I'd love to hear
> if
> > > others found NUTCH-570 helpful!
> > >
> > > Otis
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > >
> > >
> > > ----- Original Message ----
> > > > From: Siddhartha Reddy
> > > > To: [email protected]
> > > > Sent: Monday, April 21, 2008 4:59:03 PM
> > > > Subject: Re: Fetching inefficiency
> > > >
> > > > I do face a similar problem. I occasionally have some fetch jobs
> that
> > > are
> > > > fetching from less than 100 hosts, the effect is magnified in this
> case.
> > > >
> > > > I have found one workaround for this but I am not sure if this is
> the
> > > best
> > > > possible solution: I set the value of generate.max.per.host to a
> pretty
> > > > small value (like 1000) and this reduces the maximum amount of time
> any
> > > task
> > > > is going to be held up due to a particular host. This does increase
> the
> > > > number of cycles that are needed to finish a crawl but does solve
> the
> > > > mentioned problem. It might even make sense to have an even lower
> value
> > > -- I
> > > > am still experimenting to find a good value myself.
> > > >
> > > > In addition, I think NUTCH-629 and NUTCH-570 could help reduce the
> > > effects
> > > > of the problem caused by slow servers.
> > > >
> > > > Best,
> > > > Siddhartha Reddy
> > > >
> > > > On Tue, Apr 22, 2008 at 1:46 AM, wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I am wondering how others deal with the following, which I see as
> > > fetching
> > > > > inefficiency:
> > > > >
> > > > >
> > > > > When fetching, the fetchlist is broken up into multiple parts and
> > > fetchers
> > > > > on cluster nodes start fetching.  Some fetchers end up fetching
> from
> > > fast
> > > > > servers, and some from very very slow servers.  Those fetching
> from
> > > slow
> > > > > servers take a long time to complete and prolong the whole
> fetching
> > > process.
> > > > >  For instance, I've seen tasks from the same fetch job finish in
> only
> > > 1-2
> > > > > hours, and others in 10 hours.  Those taking 10 hours were stuck
> > > fetching
> > > > > pages from a single or handful of slow sites.  If you have two
> nodes
> > > doing
> > > > > the fetching and one is stuck with a slow server, the other one is
> > > idling
> > > > > and wasting time.  The node stuck with the slow server is also
> > > > > underutilized, as it's slowly fetching from only 1 server instead
> of
> > > many.
> > > > >
> > > > > I imagine anyone using Nutch is seeing the same.  If not, what's
> the
> > > > > trick?
> > > > >
> > > > > I have not tried overlapping fetching jobs yet, but I have a
> feeling
> > > that
> > > > > won't help a ton, plus it could lead to two fetchers fetching from
> the
> > > same
> > > > > server and being impolite - am I wrong?
> > > > >
> > > > > Thanks,
> > > > > Otis
> > > > > --
> > > > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > http://sids.in
> > > > "If you are not having fun, you are not doing it right."
> > >
> > >
> >
> >
> > --
> > http://sids.in
> > "If you are not having fun, you are not doing it right."
>
>


-- 
http://sids.in
"If you are not having fun, you are not doing it right."

Reply via email to