Although only one of the machines will be used for the fetch task (because all your urls are from a single host), the other tasks do not have any such requirements and can run on multiple machines. So running in the distributed might still benefit you.
To 'turn off' the 3 slaves, you can simply remove them from the conf/slaves file. You might also want to to change the other dfs parameters correspondingly. I would suggest that you totally turn off dfs in this case by setting 'fs.default.name' to 'file:///' and 'mapred.job.tracker' to 'local'. Best, Siddhartha On Jan 27, 2008 5:32 AM, John Mendenhall <[EMAIL PROTECTED]> wrote: > Andrzej Bialecki, > > > >All hosts are the same. Everyone of them. > > > > > >If there is no way to split them up, this seems to > > >imply the distributed nature of nutch is lost on > > >attempting to build an index for a single large > > >site. Please correct me if I am wrong with this > > >presumption. > > > > It doesn't matter whether you use a distributed crawl or not - you still > > are expected to crawl politely, meaning that you should not exceed > > certain rate of requests / sec to any given host. Since all your urls > > come from the same host, then no matter how many machines you trow at > > it, you will still be crawling at a rate of 1 page / 5 seconds (or > > whatever you set in the nutch-site.xml). So, a single machine can manage > > this just fine. > > Currently, I have 4 machines running nutch, one master/slave, > and 3 pure slaves. What is the best procedure for turning off > the 3 slaves? > > Should I go back to a "local" setup only, without the overhead > of hadoop dfs? > > What is the best recommendation? > > Thanks! > > JohnM > > -- > john mendenhall > [EMAIL PROTECTED] > surf utopia > internet services > -- http://sids.in "If you are not having fun, you are not doing it right."
