So I guess the job tracker is the one reading the HDFS meta-data and then
optimizing the scheduling of map jobs based on that?


On 10/25/11 3:13 PM, "Shevek" <[email protected]> wrote:

>We pray to $deity that the mapreduce block size is about the same as (or
>smaller than) the hdfs block size. We also pray that file format
>synchronization points are frequent when compared to block boundaries.
>
>The JobClient finds the location of each block of each file. It splits the
>job into FileSplit(s), with one per block.
>
>Each FileSplit is processed by a task. The Split contains the locations in
>which the task should best be run.
>
>The last block may be very short. It is then subsumed into the preceding
>block.
>
>Some data is transferred between nodes when the synchronization point for
>the file format is not at a block boundary. (It basically never is, but we
>hope it's close, or the purpose of MR locality is defeated.)
>
>Specifically to your questions: Most of the data should be read from the
>local hdfs node under the above assumptions. The communication layer
>between
>mapreduce and hdfs is not special.
>
>S.
>
>On 25 October 2011 11:49, <[email protected]> wrote:
>
>> Hello,
>>
>> I am trying to understand how data locality works in hadoop.
>>
>> If you run a map reduce job do the mappers only read data from the host
>>on
>> which they are running?
>>
>> Is there a communication protocol between the map reduce layer and HDFS
>> layer so that the mapper gets optimized to read data locally?
>>
>> Any pointers on which layer of the stack handles this?
>>
>> Cheers,
>> Ivan
>>

Reply via email to