Hi,
Hi,
Between, how the split will be done? I mean the input will be split
by HDFS
block? will I have 1 map task per HDFS block?
The default behavior is to split the file based on the HDFS block
size, but this depends on the InputFormat and you can also write your
own InputFormat to create a split of the size/nature that you want.
There are already many InputFormats that other people have written
too, please have a look, examples include: splits of N lines,
one-split per file and so on.
Yes, the default behavior is to have one mapper per input split, but
again, this can be overridden by a custom inputformat- for example, if
you ask the inputformat not to split a file, and if the file is bigger
than the block size.
will this workflow benefit
from Hadoop data locality optimization?
I did not understand this question.
Sorry, I was not clear enough...let's say that I have 1 file stored in
HDFS and so, let's say that it is split in 3 HDFS blocks. Let's say
that these HDFS blocks, blockA, blockB and blockC, reside respectively
in machineA, machineB and machineC.
In another side, let's say also that this file is the input file and
it is split based on HDFS block and so, I will have one mapper per
input split (I will have in consequence 3 mappers: mapperA, mapperB
and mapperC).
If I understand, it is waited that the mapperA will be executed on
machineA, and mapperB on machineB...right? if it is the case, that is
what I intended by the data locality optimization...the fact that each
mapper will be executed on the machine where data reside optimizes the
workflow execution, the traffic inside the cluster...
Now, in case that this input file is not split based on HDFS block but
one-split per file. I will have in consequence only 1 mapper since I
have only 1 input split. Where the computation of the mapper takes
place? in machineA or machineB or machine C or in another machine
inside the cluster? or it is not possible to predict the behavior of
the system?
Thanks for the help,
Hassen
Thanks,
-b
I hope I understood your problem properly, and my suggestion is the
kind you were looking for.
Thanks,
Hassen
Bibek