Hi VJ 1) If hadoop is used then all the slaves and other machines in the cluster > need to be formatted to have HDFS file system. If so what happens to the > tera bytes of data that need to be crunched? Or is the data on a different > machine? >
You actually assign directories on the machines to be the directories Hadoop uses in the DFS. Therefore the machines can also have other data as you don't format the whole drives. So for example I have /mnt/disk1/hadoop and /mnt/disk2/hadoop on each of my Data Nodes for HDFS to use. My machines are dedicated to Hadoop so don't store other data in addition. 2) Everywhere it is mentioned that the main advantage of map/reduce and > hadoop is that it runs on data that is available locally. So does this mean > that once the file system is formatted then I have to move my terabytes of > data and split them across the cluster? > Once you copy data into HDFS which you then *might* consider removing from the local drives. I think it is more common to dedicate a cluster to Hadoop and then copy data into the DFS from external locations (e.g. it doesn't also sit on local drives in the Hadoop cluster). This is how we use it anyway. When you launch a MR job, it knows where the data chunks are located and then runs the processing on the machine in the cluster with those chunks. Remember HDFS will store redundant copies, so you might copy in a 200GB file, it gets split up and copied around the cluster and perhaps each chunk saved 3 times. Then when it needs to process, there are 3 machines with any given chunk locally stored - Hadoop will try and schedule the tasks needed to complete the job to minimise copying around and run it on a machine with the data already. Since you seem interested in the best set up for MapReduce you might get better responses on the mapreduce-user mailing list. Hope this helps, Tim > Thanks > VJ >
