Missed link: [1] - http://wiki.apache.org/hadoop/FAQ#If_a_block_size_of_64MB_is_used_and_a_file_is_written_that_uses_less_than_64MB.2C_will_64MB_of_disk_space_be_consumed.3F
On Wed, Jul 20, 2011 at 3:37 PM, Harsh J <ha...@cloudera.com> wrote: > Florin, > > On Wed, Jul 20, 2011 at 2:03 PM, Florin P <florinp...@yahoo.com> wrote: >> Hello, Harsh! >> Thank you for your quick response. I have another questions: >> 1. your are saying that each map task will take as an input one file, but >> when the files size are less than the block size then it is possible that a >> map task to take more than one file, isn't it? > > If you have a 2 MB file on DFS with a configured block size of 256 MB, > the file still takes only 2 MB. See [1]. The block size factor is a > mere splitting enforcer, not a fill-up thing. No two files can reside > on the same 'block'. > >> 2.In this particular case, the same behavior will happen (meaning each file >> will be processed till end and then next one)? > > Unless you pack in more blocks per split with an input format like > CombineFileInputFormat, this does not happen. > > But if you do use CombineFileInputFormat then yes it does happen like that. > > Of course, you can also write your own custom InputFormat+RecordReader > that can mix files' records as you want it to (the mapjoin example > reads off multiple files at a time, for example, to join). > >> --- On Wed, 7/20/11, Harsh J <ha...@cloudera.com> wrote: >> >>> From: Harsh J <ha...@cloudera.com> >>> Subject: Re: Order of files in Map class >>> To: hdfs-user@hadoop.apache.org >>> Date: Wednesday, July 20, 2011, 3:44 AM >>> Florin, >>> >>> Your second example is how it happens in Hadoop, but >>> there's more here >>> to understand. >>> >>> To start with, your InputFormat (input splitter) computes >>> and >>> publishes a set amount of InputSplits. The total number of >>> input >>> splits is gonna be your total number of 'Map Tasks' in >>> Hadoop as the >>> job proceeds. The input splits are generally block splits, >>> i.e., >>> start-and-stop lengths over the same file. >>> >>> Each 'MapTask' is designated one split from this list of >>> splits. So >>> every map task would initialize separately, in its own JVM >>> (no shared >>> resources -- again, its a different instance of mappers per >>> file or >>> block!) and read the input split alone, into its map(key, >>> value, >>> context) function. >>> >>> So to summarize, your second example is what will happen, >>> but it would >>> be in parallel instead, such as: >>> >>> map1 | map2 | … >>> file1 | file2 | … >>> row1 | row1 | … >>> row2 | row 2 | … >>> >>> P.s. What I've explained here is the default behavior. Of >>> course >>> things can be highly tweaked to achieve other things, like >>> your first >>> example, but those probably come with greater read costs >>> attached. The >>> 'hadoop' way is data local, and one-file-per-task. >>> >>> On Wed, Jul 20, 2011 at 12:11 PM, Florin P <florinp...@yahoo.com> >>> wrote: >>> > Hello! >>> > Suppose that we have the files F1, F2,..Fk given by >>> the input splitter to the map class, what is the order in >>> which they will arrive when map function is applied? >>> > What is interesting me if it is possible that in >>> the map function to arrive mixed key-value pairs from >>> different files? They keys will arrive related with their >>> file, till no more keys are left from source file or they >>> can arrive one key from F1 one key from Fk and so on. >>> > Example: >>> > Mixed key value pairs at the map function: >>> > K1 from F1 >>> > K5 from F5 >>> > K7 from F8 >>> > etc >>> > >>> > ordered key-value pairs: >>> > K1 from F1 >>> > .. >>> > K_end_F1 from F1 >>> > K5 from F5 >>> > .. >>> > K_end_F5 from F5 >>> > and so on. >>> > >>> > I'll look forward for your answer. >>> > Regards, >>> > Florin >>> > >>> > >>> >>> >>> >>> -- >>> Harsh J >>> >> > > > > -- > Harsh J > -- Harsh J