Comments inline
Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Fri, Sep 4, 2009 at 5:46 AM, Xine Jar <[email protected]> wrote: > Hallo, > I have a mapreduce application reading from an existing hbase table. The > map > function searches for some values in the table and the reduce function > averages them. > > Tell us a bit more about the job.. This doesnt give much clarity. > *My question is simple : > > ****Method1**** > *I have initially written the program passing to the Map function the* " > Input key type: ImmutableBytesWritable and Input Value:RowResult"**. *I > have > set of course the *setInputFormat(TableInputFormat.class)* and set as well > the COLUM_LIST. > > I have added a debug user counter in order to check how often my table has > been read and discovered (with your help as well) that the table is read N > times where N is the number of rows in the table. Which was of course not > acceptable. This was due to the fact that I am passing the RowResult as an > input to the Map function. > > *****Method2***** > I decided not to pass the RowResult as an input format to the map but I > have > passed a Text which in fact I am not using at all in the map function, I > have used it only in oder to pass anything so that hadoop does not give me > an error :) . Then, similarly to the first method, in the map function I > have created a scanner on the hbase table and started reading the rows. > > With this solution, Once I haven't passed the RowResult as a parameter ot > the mapper, the job was much faster and the table was read only once!!! > Perfect! > > In this method, how do different mappers get different inputs or different values from the table? > *Question > > **-*Are there any hidden performance issues or complications behind my > method 2? > > -It is true that I reached a solution with what I have done but I am > wondering if I can do it in a cleaner way. So I was wondering if I could > somehow skip the fact passing an input key and input value to the map? If > yes how? > > You can use a filter in the scanner thats reading the table and giving an input to the mapper in Method 1. That will skip all the irrelevant records from being read and therefore speed up the job. > Regards, > CJ > Tell us a bit more so we can give inputs.
