Thank you for the clarifications.
Well Basically the table contains 1000,000 rows, some of them contain info
on temperature readings, others on moisture, and others on humidity.
*The map*:
reads the table line by line an once it finds a temperature reading it does
output("temperature",value).
*
The reducer:*
Gets the list of values, sum them and average them.
Does this need another job?
R1-I have set two map tasks and two reduce tasks. I have even changed the
reduce tasks to "4" but the performance was more or less the same. Why?
R2-A fast enough reduce phase??!! Hmmm... How can I know if is performing
worst than it should? I have read that hbase is used at Google for
GoogleEarth!!! Is it possible to know the performance of your queries in
delay? at least an order of magnitude!!
R2'- My table is now 400 MB assume I would like to have a table of 4 TB.
What kind of behaviour should I expect for the query delay?
Thank you,
CJ
On Thu, Sep 3, 2009 at 5:35 PM, Jean-Daniel Cryans <[email protected]>wrote:
> See answers.
>
> J-D
>
> On Thu, Sep 3, 2009 at 11:07 AM, Xine Jar<[email protected]> wrote:
> > Hallo,
> > I have a cluster of 6 nodes (Namenode, Jobtracker, an hbase master, and
> > three regionservers) running hadoop-0.19.1 and hbase-0.19.3. I have
> created
> > an hbase table "mytable" and have written a program to read the value in
> > each line of the table and get the overall average of the values.
> >
> > I have few quick clarification questions to pose.
> >
> > Q1- "MyTable" has one family column and has a size of 400MB. according to
> > the default value of hbase.hregion.max.filesize I
> > have EXPECTED that it should be split into two regions 256MB and
> > 144MB. But the UI on port 60010 showed that the
> > "mytable" has 3 regions (107MB+89MB+223MB). Why 3 not 2?
>
> HBase doesn't split exactly when it reaches 256MB and exactly at
> 128MB. You could easily have a daughter split getting 180MB and
> getting another split giving you a 89MB region and another one at
> 223MB.
>
> >
> > Q2- The UI of the hbase master on port 60010, showed the three regions of
> > "mytable" each with a start key and end key. I
> > noticed as well that the three regions are stored on the same
> > regionserver.The other regionservers stored the ROOT and
> > the META. Shouldn't the regions of "mytable" be equally distributed
> > and stored on all region servers?
>
> Nope, if we did that we would be constantly moving the regions around
> at the slightest change. We included some "sloppiness" so when you
> have a very small region count it's obvious but when you get around
> the real numbers (more than 100) you will see it rather well
> distributed. It's not perfect tho, we can still improve a lot of
> stuff.
>
> >
> > Q3-The job is taking around 1 minute to finish, I can see that the reduce
> > function is very slow, could you give me some hints
> > how can I make it faster? In which case should I think about
> splitting
> > the Job into 2? Something else I have to try to
> > enhance the performance?
>
> What is your reduce even doing? How many reduce tasks do you set? What
> would be in your opinion a fast enough reduce phase?
>
> >
> > Regards,
> > CJ
> >
>