Florin,

Your second example is how it happens in Hadoop, but there's more here
to understand.

To start with, your InputFormat (input splitter) computes and
publishes a set amount of InputSplits. The total number of input
splits is gonna be your total number of 'Map Tasks' in Hadoop as the
job proceeds. The input splits are generally block splits, i.e.,
start-and-stop lengths over the same file.

Each 'MapTask' is designated one split from this list of splits. So
every map task would initialize separately, in its own JVM (no shared
resources -- again, its a different instance of mappers per file or
block!) and read the input split alone, into its map(key, value,
context) function.

So to summarize, your second example is what will happen, but it would
be in parallel instead, such as:

map1 | map2 | …
file1 | file2 | …
row1 | row1 | …
row2 | row 2 | …

P.s. What I've explained here is the default behavior. Of course
things can be highly tweaked to achieve other things, like your first
example, but those probably come with greater read costs attached. The
'hadoop' way is data local, and one-file-per-task.

On Wed, Jul 20, 2011 at 12:11 PM, Florin P <florinp...@yahoo.com> wrote:
> Hello!
>  Suppose that we have the files F1, F2,..Fk given by the input splitter to 
> the map class, what is the order in which they will arrive when map function  
> is applied?
>  What is interesting me  if  it is possible that in the map function to 
> arrive mixed key-value pairs from different files? They keys will arrive 
> related with their file, till no more keys are left from source file or they 
> can arrive one key from F1 one key from Fk and so on.
>  Example:
>   Mixed key value pairs at the map function:
>    K1 from F1
>    K5 from F5
>    K7 from F8
>  etc
>
>  ordered key-value pairs:
>    K1 from F1
>   ..
>    K_end_F1 from F1
>    K5 from F5
> ..
>  K_end_F5 from F5
>  and so on.
>
> I'll look forward for your answer.
>  Regards,
>  Florin
>
>



-- 
Harsh J

Reply via email to