> Now, in case that this input file is not split based on HDFS block but > one-split per file. I will have in consequence only 1 mapper since I have > only 1 input split. Where the computation of the mapper takes place? in > machineA or machineB or machine C or in another machine inside the cluster? > or it is not possible to predict the behavior of the system?
Does the file still take up 3 data blocks? If so, I'd say the solution is to write the data with a larger block size (block size is a per file setting). That way, the whole file will be on a single node and you'll get locality for all of the data. If the split does cover multiple blocks, the input format would need to suggest running on a host that contains the most blocks of the file. I'm not sure if the base FileInputFormat does this. My guess is it would provide a hint based only on the first block in the file. -Joey On Thu, Jun 23, 2011 at 10:12 AM, Hassen Riahi <hassen.ri...@cern.ch> wrote: > Hi, > >> Hi, >> >>> Between, how the split will be done? I mean the input will be split by >>> HDFS >>> block? will I have 1 map task per HDFS block? >> >> The default behavior is to split the file based on the HDFS block >> size, but this depends on the InputFormat and you can also write your >> own InputFormat to create a split of the size/nature that you want. >> There are already many InputFormats that other people have written >> too, please have a look, examples include: splits of N lines, >> one-split per file and so on. >> >> Yes, the default behavior is to have one mapper per input split, but >> again, this can be overridden by a custom inputformat- for example, if >> you ask the inputformat not to split a file, and if the file is bigger >> than the block size. >> >> will this workflow benefit >>> >>> from Hadoop data locality optimization? >>> >> >> I did not understand this question. > > Sorry, I was not clear enough...let's say that I have 1 file stored in HDFS > and so, let's say that it is split in 3 HDFS blocks. Let's say that these > HDFS blocks, blockA, blockB and blockC, reside respectively in machineA, > machineB and machineC. > > In another side, let's say also that this file is the input file and it is > split based on HDFS block and so, I will have one mapper per input split (I > will have in consequence 3 mappers: mapperA, mapperB and mapperC). > > If I understand, it is waited that the mapperA will be executed on machineA, > and mapperB on machineB...right? if it is the case, that is what I intended > by the data locality optimization...the fact that each mapper will be > executed on the machine where data reside optimizes the workflow execution, > the traffic inside the cluster... > > Now, in case that this input file is not split based on HDFS block but > one-split per file. I will have in consequence only 1 mapper since I have > only 1 input split. Where the computation of the mapper takes place? in > machineA or machineB or machine C or in another machine inside the cluster? > or it is not possible to predict the behavior of the system? > > Thanks for the help, > Hassen > >> >> Thanks, >> -b >> >>>> >>>> I hope I understood your problem properly, and my suggestion is the >>>> kind you were looking for. >>> >>> Thanks, >>> Hassen >>> >>>> >>>> Bibek >>> >>> > > -- Joseph Echeverria Cloudera, Inc. 443.305.9434