Re: Newbie questions on Hadoop topology

Todd Lipcon Sat, 04 Apr 2009 21:58:21 -0700

On Sat, Apr 4, 2009 at 3:47 AM, Foss User <[email protected]> wrote:

> Certain things are not clear. I am asking them point-wise. I have a
> setup of 4 linux machines. 1 name node, 1 job tracker and 2 slaves
> (each is data node as well as task tracker).
>


For a cluster of this size, you probably want to run one machine that is
both the NN and JT, and the other 3 as slaves. There's no problem colocating
multiple daemons on the same box as long as it's not overloaded. Given it's
a small cluster, it should be fine.


>
> 1. Should I edit conf/slaves on all nodes or only on name node? Do I
> have to edit this in job tracker too?
>

The conf/slaves file is only used by the start/stop scripts (e.g.
start-all.sh). This script is just a handy wrapper that sshs to all of the
slaves to start the datanode/tasktrackers on those machines. So, you should
edit conf/slaves on whatever machine you tend to run those administrative
scripts from, but those are for convenience only and not necessary. You can
start the datanode/tasktracker services on the slave nodes manually and it
will work just the same.


>
> 2. What does the 'bin/hadoop namenode -format' actually do? I want to
> know in the OS level. Does it create some temporary folders in all the
> slave-data-nodes which will be collectively interpreted as HDFS by the
> Hadoop framework?
>

namenode -format is run on the namenode machine and sets up the on-disk
database/storage for the filesystem metadata in dfs.name.dir. The datanodes
maintain their storage automatically and don't need any particular "format"
command to be run - simply list a directory in dfs.data.dir in
hadoop-site.xml, and the datanode will start using it for block storage.


>
> 3. Does the 'bin/hadoop namenode -format' command affect name node,
> job tracker and task tracker nodes (assuming there is a slave which is
> only a task tracker and not a data node)?
>

See above -- it simply affects the metadata store on the namenode. The
jobtracker and task trackers are unaffected, and technically the datanodes
are unaffected as well. Datanodes will "find out" about the formatting when
they report block locations for files that the namenode no longer knows
about.


>
> 4. If I add one more slave (datanode + task tracker) later to the
> cluster, what are the changes I need to do apart from adding the IP
> address of the slave node to conf/slaves? Do I need to restart any
> service?
>

You simply need to start the DN/TT on the new node. Adding it to conf/slaves
only affects the start/stop scripts. The DN and TT will contact the NN/JT
respectively and register themselves in the system.


>
> 5. When I add a new slave to the cluster later, do I need to run the
> namenode -format command again? If I have to, how do I ensure that
> existing data is not lost. If I don't have to, how will the folders
> necessary for HDFS be created in the new slave machine?
>


No - after starting the slave, the NN and JT will start assigning
blocks/jobs to the new slave immediately. The HDFS directories will be
created when you start up the datanode - you just need to ensure that the
directory configured in dfs.data.dir exists and is writable by the hadoop
user.

Hope that helps

-Todd

Re: Newbie questions on Hadoop topology

Reply via email to