Re: Parallelize a workflow using mapReduce

Hassen Riahi Thu, 23 Jun 2011 07:13:12 -0700

Hi,

Hi,

Between, how the split will be done? I mean the input will be splitby HDFS
block? will I have 1 map task per HDFS block?


The default behavior is to split the file based on the HDFS block
size, but this depends on the InputFormat and you can also write your
own InputFormat to create a split of the size/nature that you want.
There are already many InputFormats that other people have written
too, please have a look, examples include: splits of N lines,
one-split per file and so on.

Yes, the default behavior is to have one mapper per input split, but
again, this can be overridden by a custom inputformat- for example, if
you ask the inputformat not to split a file, and if the file is bigger
than the block size.

will this workflow benefit

from Hadoop data locality optimization?


I did not understand this question.

Sorry, I was not clear enough...let's say that I have 1 file stored inHDFS and so, let's say that it is split in 3 HDFS blocks. Let's saythat these HDFS blocks, blockA, blockB and blockC, reside respectivelyin machineA, machineB and machineC.

In another side, let's say also that this file is the input file andit is split based on HDFS block and so, I will have one mapper perinput split (I will have in consequence 3 mappers: mapperA, mapperBand mapperC).

If I understand, it is waited that the mapperA will be executed onmachineA, and mapperB on machineB...right? if it is the case, that iswhat I intended by the data locality optimization...the fact that eachmapper will be executed on the machine where data reside optimizes theworkflow execution, the traffic inside the cluster...

Now, in case that this input file is not split based on HDFS block butone-split per file. I will have in consequence only 1 mapper since Ihave only 1 input split. Where the computation of the mapper takesplace? in machineA or machineB or machine C or in another machineinside the cluster? or it is not possible to predict the behavior ofthe system?


Thanks for the help,
Hassen


Thanks,
-b


I hope I understood your problem properly, and my suggestion is the
kind you were looking for.


Thanks,
Hassen


Bibek

Re: Parallelize a workflow using mapReduce

Reply via email to