On 25 October 2011 17:36, <[email protected]> wrote: > So I guess the job tracker is the one reading the HDFS meta-data and then > optimizing the scheduling of map jobs based on that? >
Currently no, it's the JobClient that does it. Although you use the term "scheduling" in a way which may confuse some other participants in the thread, since your original question wasn't about what Hadoop people usually call the "scheduler". S. > On 10/25/11 3:13 PM, "Shevek" <[email protected]> wrote: > > >We pray to $deity that the mapreduce block size is about the same as (or > >smaller than) the hdfs block size. We also pray that file format > >synchronization points are frequent when compared to block boundaries. > > > >The JobClient finds the location of each block of each file. It splits the > >job into FileSplit(s), with one per block. > > > >Each FileSplit is processed by a task. The Split contains the locations in > >which the task should best be run. > > > >The last block may be very short. It is then subsumed into the preceding > >block. > > > >Some data is transferred between nodes when the synchronization point for > >the file format is not at a block boundary. (It basically never is, but we > >hope it's close, or the purpose of MR locality is defeated.) > > > >Specifically to your questions: Most of the data should be read from the > >local hdfs node under the above assumptions. The communication layer > >between > >mapreduce and hdfs is not special. > > > >S. > > > >On 25 October 2011 11:49, <[email protected]> wrote: > > > >> Hello, > >> > >> I am trying to understand how data locality works in hadoop. > >> > >> If you run a map reduce job do the mappers only read data from the host > >>on > >> which they are running? > >> > >> Is there a communication protocol between the map reduce layer and HDFS > >> layer so that the mapper gets optimized to read data locally? > >> > >> Any pointers on which layer of the stack handles this? > >> > >> Cheers, > >> Ivan > >> > >
