Doug Cutting wrote:
Where are you setting mapred.map.tasks? It should be set in
mapred-default.xml used when starting the tasktracker. You must restart
the task trackers for this setting to take effect:
The settings are done in mapred-default.xml
bin/nutch-daemons.sh stop tasktracker
bin/nutch-daemons.sh start tasktracker
I have tried this, but I think that stop_all.sh & start_all.sh does this also.
(I assume your ~/.slaves file contains srv34 and srv21.)
Yes, correct
Note that nutch-daemons.sh assumes that /etc/ssh/sshd_config has
'AcceptEnv yes'. Its also useful, if your hosts don't share the nutch
installation via NFS, to set NUTCH_MASTER with something like:
AcceptEnv yes or AcceptEnv * is not working (Bad configuration option:
AcceptEnv) but I can have set PermitUserEnvironment yes now.
export NUTCH_MASTER=srv21:/home/USER/src/nutch
Good option, lets try to get it working first :)
I'm working on documenting this stuff better...
Should be nice if there was some good documentation, but we will find the
problem.
I start the crawler with: "bin/nutch crawl seeds -depth 10" is this ok?
Ok now the crwaler starts but get stuck afster 25% I think this is an
other problem. But ok, its also only at the primary server.
The firewall is full open so that can't be the problem.
On the master (srv21 - conf/nutch-site.xml):
<name>fs.default.name</name>
<value>localhost:9009</value>
<name>mapred.job.tracker</name>
<value>localhost:9010</value>
On the master (srv21 - /root/.slaves)
srv21
srv34
On the slave (srv34 - conf/nutch-site.xml):
<name>fs.default.name</name>
<value>srv21:9009</value>
<name>mapred.job.tracker</name>
<value>srv21:9010</value>
On the master & slave (srv21/srv34 - conf/mapred-default.xml)
<name>mapred.map.tasks</name>
<value>4</value>
<name>mapred.reduce.tasks</name>
<value>2</value>
Is there a way that you can say, craw the urls I submit to the server and
update them if you have done all the urls but not more then one time in x
days.