Hi,

Hi,

Between, how the split will be done? I mean the input will be split by HDFS
block? will I have 1 map task per HDFS block?

The default behavior is to split the file based on the HDFS block
size, but this depends on the InputFormat and you can also write your
own InputFormat to create a split of the size/nature that you want.
There are already many InputFormats that other people have written
too, please have a look, examples include: splits of N lines,
one-split per file and so on.

Yes, the default behavior is to have one mapper per input split, but
again, this can be overridden by a custom inputformat- for example, if
you ask the inputformat not to split a file, and if the file is bigger
than the block size.

will this workflow benefit
from Hadoop data locality optimization?


I did not understand this question.

Sorry, I was not clear enough...let's say that I have 1 file stored in HDFS and so, let's say that it is split in 3 HDFS blocks. Let's say that these HDFS blocks, blockA, blockB and blockC, reside respectively in machineA, machineB and machineC.

In another side, let's say also that this file is the input file and it is split based on HDFS block and so, I will have one mapper per input split (I will have in consequence 3 mappers: mapperA, mapperB and mapperC).

If I understand, it is waited that the mapperA will be executed on machineA, and mapperB on machineB...right? if it is the case, that is what I intended by the data locality optimization...the fact that each mapper will be executed on the machine where data reside optimizes the workflow execution, the traffic inside the cluster...

Now, in case that this input file is not split based on HDFS block but one-split per file. I will have in consequence only 1 mapper since I have only 1 input split. Where the computation of the mapper takes place? in machineA or machineB or machine C or in another machine inside the cluster? or it is not possible to predict the behavior of the system?

Thanks for the help,
Hassen


Thanks,
-b


I hope I understood your problem properly, and my suggestion is the
kind you were looking for.

Thanks,
Hassen


Bibek



Reply via email to