The RecordReader code gets executed on the node in which the maps are run. The framework tries to run maps on nodes that contain the split. However, there is no guarantee that maps will only run on nodes that contain the split. If a split spans multiple blocks, attempt will be made to choose a node that contains the maximum data (across the multiple blocks) in that split for the map to run.
Jothi On 2/6/09 2:54 AM, "Saptarshi Guha" <[email protected]> wrote: > Hello All, > In order to get a better understanding of Hadoop, i've started reading > the source and have a question > The FileInputFormat, reads in files, splits into splitsizes (which may > be bigger than block size) and creates FileSplits. > The FileSplits contain the start, length *and* the locations of the split. > The LineRecordReader, receives a split and emits records. > > So far I think i'm correct(hopefully). Now, my questions > Does the LineRecordReader run on a machine, in some sense, closest to > the location of the splits? i.e > Q1: If the split is less than the block size, then the split is > located on one machine (apart from replicates): does the > LineRecordReader run on the machine which contains the split? Or at > least attempt to? > Q2. If a split is greater than the block size, it spans multiple > blocks which could reside on more than 1 machine. In this case, on > which machine does the LineRecordReader run? The machine 'closest' to > them? > > Please correct me if i'm wrong. > Thank you > Saptarshi >
