1. I have a 512 node cluster. I need to have 32 nodes do something else. They can be datanodes but I cannot run any map or reduce jobs on them. So I see three options.
1. Stop the tasktracker on those nodes. leave the datanode running. 2. Set mapred.tasktracker.reduce.tasks.maximum and mapred.tasktracker.map.tasks.maximum to 0 on these nodes and make these final. 3. Use the parameter mapred.hosts.exclude. I am assuming that any of the three methods would work. To start with, I went with option 3. I used a local file /home/hadoop/myjob.exclude and the file myjob.exclude had the hostname of one host per line ( hadoop-480 .. hadoop-511. But I see both map and reduce jobs being scheduled to all the 511 nodes. I understand there is an inherent inefficieny by running only the data node on these 32 nodess. Here are my questions. 1. Will all three methods work? 2. If I choose method 3, does this file exist as a dfs file or a regular file. If regular file , does it need to exist on all the nodes or only the node where teh job is submitted? Many thanks in advance/ Raj
