On Sat, Apr 4, 2009 at 3:47 AM, Foss User <[email protected]> wrote: > Certain things are not clear. I am asking them point-wise. I have a > setup of 4 linux machines. 1 name node, 1 job tracker and 2 slaves > (each is data node as well as task tracker). >
For a cluster of this size, you probably want to run one machine that is both the NN and JT, and the other 3 as slaves. There's no problem colocating multiple daemons on the same box as long as it's not overloaded. Given it's a small cluster, it should be fine. > > 1. Should I edit conf/slaves on all nodes or only on name node? Do I > have to edit this in job tracker too? > The conf/slaves file is only used by the start/stop scripts (e.g. start-all.sh). This script is just a handy wrapper that sshs to all of the slaves to start the datanode/tasktrackers on those machines. So, you should edit conf/slaves on whatever machine you tend to run those administrative scripts from, but those are for convenience only and not necessary. You can start the datanode/tasktracker services on the slave nodes manually and it will work just the same. > > 2. What does the 'bin/hadoop namenode -format' actually do? I want to > know in the OS level. Does it create some temporary folders in all the > slave-data-nodes which will be collectively interpreted as HDFS by the > Hadoop framework? > namenode -format is run on the namenode machine and sets up the on-disk database/storage for the filesystem metadata in dfs.name.dir. The datanodes maintain their storage automatically and don't need any particular "format" command to be run - simply list a directory in dfs.data.dir in hadoop-site.xml, and the datanode will start using it for block storage. > > 3. Does the 'bin/hadoop namenode -format' command affect name node, > job tracker and task tracker nodes (assuming there is a slave which is > only a task tracker and not a data node)? > See above -- it simply affects the metadata store on the namenode. The jobtracker and task trackers are unaffected, and technically the datanodes are unaffected as well. Datanodes will "find out" about the formatting when they report block locations for files that the namenode no longer knows about. > > 4. If I add one more slave (datanode + task tracker) later to the > cluster, what are the changes I need to do apart from adding the IP > address of the slave node to conf/slaves? Do I need to restart any > service? > You simply need to start the DN/TT on the new node. Adding it to conf/slaves only affects the start/stop scripts. The DN and TT will contact the NN/JT respectively and register themselves in the system. > > 5. When I add a new slave to the cluster later, do I need to run the > namenode -format command again? If I have to, how do I ensure that > existing data is not lost. If I don't have to, how will the folders > necessary for HDFS be created in the new slave machine? > No - after starting the slave, the NN and JT will start assigning blocks/jobs to the new slave immediately. The HDFS directories will be created when you start up the datanode - you just need to ensure that the directory configured in dfs.data.dir exists and is writable by the hadoop user. Hope that helps -Todd
