Although only one of the machines will be used for the fetch task (because
all your urls are from a single host), the other tasks do not have any such
requirements and can run on multiple machines. So running in the distributed
might still benefit you.

To 'turn off' the 3 slaves, you can simply remove them from the conf/slaves
file. You might also want to to change the other dfs parameters
correspondingly. I would suggest that you totally turn off dfs in this case
by setting 'fs.default.name' to 'file:///' and 'mapred.job.tracker' to
'local'.

Best,
Siddhartha

On Jan 27, 2008 5:32 AM, John Mendenhall <[EMAIL PROTECTED]> wrote:

> Andrzej Bialecki,
>
> > >All hosts are the same.  Everyone of them.
> > >
> > >If there is no way to split them up, this seems to
> > >imply the distributed nature of nutch is lost on
> > >attempting to build an index for a single large
> > >site.  Please correct me if I am wrong with this
> > >presumption.
> >
> > It doesn't matter whether you use a distributed crawl or not - you still
> > are expected to crawl politely, meaning that you should not exceed
> > certain rate of requests / sec to any given host. Since all your urls
> > come from the same host, then no matter how many machines you trow at
> > it, you will still be crawling at a rate of 1 page / 5 seconds (or
> > whatever you set in the nutch-site.xml). So, a single machine can manage
> > this just fine.
>
> Currently, I have 4 machines running nutch, one master/slave,
> and 3 pure slaves.  What is the best procedure for turning off
> the 3 slaves?
>
> Should I go back to a "local" setup only, without the overhead
> of hadoop dfs?
>
> What is the best recommendation?
>
> Thanks!
>
> JohnM
>
> --
> john mendenhall
> [EMAIL PROTECTED]
> surf utopia
> internet services
>



-- 
http://sids.in
"If you are not having fun, you are not doing it right."

Reply via email to