Andrzej Bialecki, > >All hosts are the same. Everyone of them. > > > >If there is no way to split them up, this seems to > >imply the distributed nature of nutch is lost on > >attempting to build an index for a single large > >site. Please correct me if I am wrong with this > >presumption. > > It doesn't matter whether you use a distributed crawl or not - you still > are expected to crawl politely, meaning that you should not exceed > certain rate of requests / sec to any given host. Since all your urls > come from the same host, then no matter how many machines you trow at > it, you will still be crawling at a rate of 1 page / 5 seconds (or > whatever you set in the nutch-site.xml). So, a single machine can manage > this just fine.
Currently, I have 4 machines running nutch, one master/slave, and 3 pure slaves. What is the best procedure for turning off the 3 slaves? Should I go back to a "local" setup only, without the overhead of hadoop dfs? What is the best recommendation? Thanks! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services
