On 25 October 2011 17:36, <[email protected]> wrote:

> So I guess the job tracker is the one reading the HDFS meta-data and then
> optimizing the scheduling of map jobs based on that?
>

Currently no, it's the JobClient that does it. Although you use the term
"scheduling" in a way which may confuse some other participants in the
thread, since your original question wasn't about what Hadoop people usually
call the "scheduler".

S.


> On 10/25/11 3:13 PM, "Shevek" <[email protected]> wrote:
>
> >We pray to $deity that the mapreduce block size is about the same as (or
> >smaller than) the hdfs block size. We also pray that file format
> >synchronization points are frequent when compared to block boundaries.
> >
> >The JobClient finds the location of each block of each file. It splits the
> >job into FileSplit(s), with one per block.
> >
> >Each FileSplit is processed by a task. The Split contains the locations in
> >which the task should best be run.
> >
> >The last block may be very short. It is then subsumed into the preceding
> >block.
> >
> >Some data is transferred between nodes when the synchronization point for
> >the file format is not at a block boundary. (It basically never is, but we
> >hope it's close, or the purpose of MR locality is defeated.)
> >
> >Specifically to your questions: Most of the data should be read from the
> >local hdfs node under the above assumptions. The communication layer
> >between
> >mapreduce and hdfs is not special.
> >
> >S.
> >
> >On 25 October 2011 11:49, <[email protected]> wrote:
> >
> >> Hello,
> >>
> >> I am trying to understand how data locality works in hadoop.
> >>
> >> If you run a map reduce job do the mappers only read data from the host
> >>on
> >> which they are running?
> >>
> >> Is there a communication protocol between the map reduce layer and HDFS
> >> layer so that the mapper gets optimized to read data locally?
> >>
> >> Any pointers on which layer of the stack handles this?
> >>
> >> Cheers,
> >> Ivan
> >>
>
>

Reply via email to