On Sun, Apr 5, 2009 at 1:14 AM, Foss User <[email protected]> wrote: > I am trying to learn Hadoop and a lot of questions come to my mind > when I try to learn it. So, I will be asking a few questions here from > time to time until I feel completely comfortable with it. Here are > some questions now: > > 1. Is it true that Hadoop should be installed on the same location on > all Linux machines? As per what I have understood, it is necessary to > install them on the same machine on all nodes as if I am going to use > bin/start-dfs.sh and bin/start-mapred.sh to start the data nodes and > task trackers on all slaves. Otherwise, it is not required. How > correct I am?
That's correct. To use those scripts, the "hadoop" script needs to be in the same location. The different machines could theoretically have different hadoop-site.xml files, though, which pointed dfs.name.dir to different locations. This makes management a bit trickier, but is useful if you have different disk setups on different machines. > > > 2. Say, a slave goes down (due to network problems or power cut) while > a word count job was going on. When it comes up again, what are the > tasks I need to do? bin/hadoop-daemon.sh start datanode and > bin/hadoop-daemon.sh start tasktracker is enough for recovery? Do, I > have to delete any /tmp/hadoop-hadoop directories before starting? Is > it guaranteed that on starting, any corrupt files in tmp directory > would be discarded and everything would be restored to normalcy? > Yes - just starting the daemons should be enough. They'll clean up their temporary files on their own. > > 3. Say, I have 1 master and 4 slaves and I start datanode on 2 slaves > and tasktracker on the other two. I put files in the HDFS. it means > that the files would be stored in the first two datanodes. Then I run > a word count job. This means that the word count jobs would run on the > two task trackers. How would the two task trackers now get the files > to do the word counting? In the documentations I was reading that the > jobs are run on those nodes which have the data. but in this setup, > the data nodes and job trackers are separate. So, how will the word > count job do its work? > Hadoop will *try* to schedule jobs with data locality in mind, but if that's impossible, it will read data off of remote nodes. Even when a task is being run data-local, it uses the same TCP-based protocol to get data off the datanode (this is something that is currently being worked on) Data-locality is an optimization to avoid network IO, but not necessary. FYI, you shouldn't run with fewer than 3 datanodes with the default configuration. This may be the source of some of your problems in other messages youv'e sent recently. The default value for dfs.replication in hadoop-default.xml is 3, meaning that it will try to place blocks on at least 3 machines. If there are only 2 machines up, all of your blocks by definition will be under-replicated, and your cluster will be somewhat grumpy. -Todd
