But I guess it isn't always possible to achieve optimal scheduling, right? What's done then; any account for network topology perhaps?
26.10.2011, в 4:42, Mapred Learn <[email protected]> написал(а): > Yes that's right ! > > Sent from my iPhone > > On Oct 25, 2011, at 5:36 PM, <[email protected]> wrote: > >> So I guess the job tracker is the one reading the HDFS meta-data and then >> optimizing the scheduling of map jobs based on that? >> >> >> On 10/25/11 3:13 PM, "Shevek" <[email protected]> wrote: >> >>> We pray to $deity that the mapreduce block size is about the same as (or >>> smaller than) the hdfs block size. We also pray that file format >>> synchronization points are frequent when compared to block boundaries. >>> >>> The JobClient finds the location of each block of each file. It splits the >>> job into FileSplit(s), with one per block. >>> >>> Each FileSplit is processed by a task. The Split contains the locations in >>> which the task should best be run. >>> >>> The last block may be very short. It is then subsumed into the preceding >>> block. >>> >>> Some data is transferred between nodes when the synchronization point for >>> the file format is not at a block boundary. (It basically never is, but we >>> hope it's close, or the purpose of MR locality is defeated.) >>> >>> Specifically to your questions: Most of the data should be read from the >>> local hdfs node under the above assumptions. The communication layer >>> between >>> mapreduce and hdfs is not special. >>> >>> S. >>> >>> On 25 October 2011 11:49, <[email protected]> wrote: >>> >>>> Hello, >>>> >>>> I am trying to understand how data locality works in hadoop. >>>> >>>> If you run a map reduce job do the mappers only read data from the host >>>> on >>>> which they are running? >>>> >>>> Is there a communication protocol between the map reduce layer and HDFS >>>> layer so that the mapper gets optimized to read data locally? >>>> >>>> Any pointers on which layer of the stack handles this? >>>> >>>> Cheers, >>>> Ivan >>>> >>
