Yes that's right ! Sent from my iPhone
On Oct 25, 2011, at 5:36 PM, <[email protected]> wrote: > So I guess the job tracker is the one reading the HDFS meta-data and then > optimizing the scheduling of map jobs based on that? > > > On 10/25/11 3:13 PM, "Shevek" <[email protected]> wrote: > >> We pray to $deity that the mapreduce block size is about the same as (or >> smaller than) the hdfs block size. We also pray that file format >> synchronization points are frequent when compared to block boundaries. >> >> The JobClient finds the location of each block of each file. It splits the >> job into FileSplit(s), with one per block. >> >> Each FileSplit is processed by a task. The Split contains the locations in >> which the task should best be run. >> >> The last block may be very short. It is then subsumed into the preceding >> block. >> >> Some data is transferred between nodes when the synchronization point for >> the file format is not at a block boundary. (It basically never is, but we >> hope it's close, or the purpose of MR locality is defeated.) >> >> Specifically to your questions: Most of the data should be read from the >> local hdfs node under the above assumptions. The communication layer >> between >> mapreduce and hdfs is not special. >> >> S. >> >> On 25 October 2011 11:49, <[email protected]> wrote: >> >>> Hello, >>> >>> I am trying to understand how data locality works in hadoop. >>> >>> If you run a map reduce job do the mappers only read data from the host >>> on >>> which they are running? >>> >>> Is there a communication protocol between the map reduce layer and HDFS >>> layer so that the mapper gets optimized to read data locally? >>> >>> Any pointers on which layer of the stack handles this? >>> >>> Cheers, >>> Ivan >>> >
