Hi all, I read the "Anatomy of a MapReduce Job Run with Hadoop" tutorial by Tom White (http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/), and have a few questions related to the how the input splits are computed, stored and retrieved.
As per the tutorial, the JobClient's submitJob() function computes and stores the input splits in a directory named after JobID. Is this directory stored in the shared filesystem? (The figure 6.1 seems to indicate so). 1) If yes, then what does it mean to store input splits? Does the following happen during the process of computing the splits? The input file to the job is first retrieved from HDFS and splits are created and then stored in a directory (named after JobID) on the HDFS. For example, I have a 128MB file stored on HDFS with 64 MB block size. So the HDFS stores the file as two 64 MB blocks. However, if I set my mapred.min.split.size to 128MB and pass this file as input, then I see only 1 map task being launched. I see an entry like "<task_id> has split on node: <tasktracker_id>" in my jobtrakcer log. What does the above entry mean? HDFS has the file stored as two different blocks. Are the two 64MB blocks first retrieved from HDFS and then merged into a single 128MB file that is then stored as single input split? 2) Also, how does a TaskTracker access a split assigned to it? I mean is there a difference in the way a local (on the same machine) vs. a non-local (machine on the same or different rack) split retrieved by the TaskTracker. I assume a TaskTracker has to connect to the DataNode hosting the block(s) corresponding to a split. Are different methods invoked for accessing local vs non-local splits? Is the input split first stored on the TaskTracker's local filesystem or is it loaded directly into memory as it is retrieved? Does this depend on whether the split is on the same machine as the TaskTracker or not? Thanks, Abhishek
