Re: Parallelize a workflow using mapReduce

Joey Echeverria Thu, 23 Jun 2011 08:26:19 -0700

> Now, in case that this input file is not split based on HDFS block but
> one-split per file. I will have in consequence only 1 mapper since I have
> only 1 input split. Where the computation of the mapper takes place? in
> machineA or machineB or machine C or in another machine inside the cluster?
> or it is not possible to predict the behavior of the system?


Does the file still take up 3 data blocks? If so, I'd say the solution
is to write the data with a larger block size (block size is a per
file setting). That way, the whole file will be on a single node and
you'll get locality for all of the data.

If the split does cover multiple blocks, the input format would need
to suggest running on a host that contains the most blocks of the
file. I'm not sure if the base FileInputFormat does this. My guess is
it would provide a hint based only on the first block in the file.

-Joey

On Thu, Jun 23, 2011 at 10:12 AM, Hassen Riahi <hassen.ri...@cern.ch> wrote:
> Hi,
>
>> Hi,
>>
>>> Between, how the split will be done? I mean the input will be split by
>>> HDFS
>>> block? will I have 1 map task per HDFS block?
>>
>> The default behavior is to split the file based on the HDFS block
>> size, but this depends on the InputFormat and you can also write your
>> own InputFormat to create a split of the size/nature that you want.
>> There are already many InputFormats that other people have written
>> too, please have a look, examples include: splits of N lines,
>> one-split per file and so on.
>>
>> Yes, the default behavior is to have one mapper per input split, but
>> again, this can be overridden by a custom inputformat- for example, if
>> you ask the inputformat not to split a file, and if the file is bigger
>> than the block size.
>>
>> will this workflow benefit
>>>
>>> from Hadoop data locality optimization?
>>>
>>
>> I did not understand this question.
>
> Sorry, I was not clear enough...let's say that I have 1 file stored in HDFS
> and so, let's say that it is split in 3 HDFS blocks. Let's say that these
> HDFS blocks, blockA, blockB and blockC, reside respectively in machineA,
> machineB and machineC.
>
> In another side, let's say also that this file is the input file and it is
> split based on HDFS block and so, I will have one mapper per input split (I
> will have in consequence 3 mappers: mapperA, mapperB and mapperC).
>
> If I understand, it is waited that the mapperA will be executed on machineA,
> and mapperB on machineB...right? if it is the case, that is what I intended
> by the data locality optimization...the fact that each mapper will be
> executed on the machine where data reside optimizes the workflow execution,
> the traffic inside the cluster...
>
> Now, in case that this input file is not split based on HDFS block but
> one-split per file. I will have in consequence only 1 mapper since I have
> only 1 input split. Where the computation of the mapper takes place? in
> machineA or machineB or machine C or in another machine inside the cluster?
> or it is not possible to predict the behavior of the system?
>
> Thanks for the help,
> Hassen
>
>>
>> Thanks,
>> -b
>>
>>>>
>>>> I hope I understood your problem properly, and my suggestion is the
>>>> kind you were looking for.
>>>
>>> Thanks,
>>> Hassen
>>>
>>>>
>>>> Bibek
>>>
>>>
>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Parallelize a workflow using mapReduce

Reply via email to