Missed link: [1] -
http://wiki.apache.org/hadoop/FAQ#If_a_block_size_of_64MB_is_used_and_a_file_is_written_that_uses_less_than_64MB.2C_will_64MB_of_disk_space_be_consumed.3F

On Wed, Jul 20, 2011 at 3:37 PM, Harsh J <ha...@cloudera.com> wrote:
> Florin,
>
> On Wed, Jul 20, 2011 at 2:03 PM, Florin P <florinp...@yahoo.com> wrote:
>> Hello, Harsh!
>>  Thank you for your quick response. I have another questions:
>> 1. your are saying that each map task will take as an input one file, but 
>> when the files size are less than the block size then it is possible that a 
>> map task to take more than one file, isn't it?
>
> If you have a 2 MB file on DFS with a configured block size of 256 MB,
> the file still takes only 2 MB. See [1]. The block size factor is a
> mere splitting enforcer, not a fill-up thing. No two files can reside
> on the same 'block'.
>
>> 2.In this particular case, the same behavior will happen (meaning each file 
>> will be processed till end and then next one)?
>
> Unless you pack in more blocks per split with an input format like
> CombineFileInputFormat, this does not happen.
>
> But if you do use CombineFileInputFormat then yes it does happen like that.
>
> Of course, you can also write your own custom InputFormat+RecordReader
> that can mix files' records as you want it to (the mapjoin example
> reads off multiple files at a time, for example, to join).
>
>> --- On Wed, 7/20/11, Harsh J <ha...@cloudera.com> wrote:
>>
>>> From: Harsh J <ha...@cloudera.com>
>>> Subject: Re: Order of files in Map class
>>> To: hdfs-user@hadoop.apache.org
>>> Date: Wednesday, July 20, 2011, 3:44 AM
>>> Florin,
>>>
>>> Your second example is how it happens in Hadoop, but
>>> there's more here
>>> to understand.
>>>
>>> To start with, your InputFormat (input splitter) computes
>>> and
>>> publishes a set amount of InputSplits. The total number of
>>> input
>>> splits is gonna be your total number of 'Map Tasks' in
>>> Hadoop as the
>>> job proceeds. The input splits are generally block splits,
>>> i.e.,
>>> start-and-stop lengths over the same file.
>>>
>>> Each 'MapTask' is designated one split from this list of
>>> splits. So
>>> every map task would initialize separately, in its own JVM
>>> (no shared
>>> resources -- again, its a different instance of mappers per
>>> file or
>>> block!) and read the input split alone, into its map(key,
>>> value,
>>> context) function.
>>>
>>> So to summarize, your second example is what will happen,
>>> but it would
>>> be in parallel instead, such as:
>>>
>>> map1 | map2 | …
>>> file1 | file2 | …
>>> row1 | row1 | …
>>> row2 | row 2 | …
>>>
>>> P.s. What I've explained here is the default behavior. Of
>>> course
>>> things can be highly tweaked to achieve other things, like
>>> your first
>>> example, but those probably come with greater read costs
>>> attached. The
>>> 'hadoop' way is data local, and one-file-per-task.
>>>
>>> On Wed, Jul 20, 2011 at 12:11 PM, Florin P <florinp...@yahoo.com>
>>> wrote:
>>> > Hello!
>>> >  Suppose that we have the files F1, F2,..Fk given by
>>> the input splitter to the map class, what is the order in
>>> which they will arrive when map function  is applied?
>>> >  What is interesting me  if  it is possible that in
>>> the map function to arrive mixed key-value pairs from
>>> different files? They keys will arrive related with their
>>> file, till no more keys are left from source file or they
>>> can arrive one key from F1 one key from Fk and so on.
>>> >  Example:
>>> >   Mixed key value pairs at the map function:
>>> >    K1 from F1
>>> >    K5 from F5
>>> >    K7 from F8
>>> >  etc
>>> >
>>> >  ordered key-value pairs:
>>> >    K1 from F1
>>> >   ..
>>> >    K_end_F1 from F1
>>> >    K5 from F5
>>> > ..
>>> >  K_end_F5 from F5
>>> >  and so on.
>>> >
>>> > I'll look forward for your answer.
>>> >  Regards,
>>> >  Florin
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>
>
>
> --
> Harsh J
>



-- 
Harsh J

Reply via email to