Re: Should the regions of a table be more or less equally distributed and stored?

Xine Jar Thu, 03 Sep 2009 09:30:21 -0700

Thank you for the clarifications.

Well Basically the table contains 1000,000 rows, some of them contain info
on temperature readings, others on moisture, and others on humidity.


*The map*:
reads the table line by line an once it finds a temperature reading it does
output("temperature",value).
*
The reducer:*
Gets the list of values, sum them and average them.
Does this need another job?

R1-I have set two map tasks and two reduce tasks. I have even changed the
reduce tasks to "4" but the performance was more or less the same. Why?

R2-A fast enough reduce phase??!! Hmmm... How can I know if is performing
worst than it should? I have read that hbase is used at Google for
GoogleEarth!!! Is it possible to know the performance of your queries in
delay? at least an order of magnitude!!

R2'- My table is now 400 MB assume I would like to have a table of 4 TB.
What kind of behaviour should I expect for the query delay?

Thank you,
CJ


On Thu, Sep 3, 2009 at 5:35 PM, Jean-Daniel Cryans <[email protected]>wrote:

> See answers.
>
> J-D
>
> On Thu, Sep 3, 2009 at 11:07 AM, Xine Jar<[email protected]> wrote:
> > Hallo,
> > I have a cluster of 6 nodes (Namenode, Jobtracker, an hbase master, and
> > three regionservers) running hadoop-0.19.1 and hbase-0.19.3. I have
> created
> > an hbase table "mytable" and have written a program to read the value in
> > each line of the table and get the overall average of the values.
> >
> > I have few quick clarification questions to pose.
> >
> > Q1- "MyTable" has one family column and has a size of 400MB. according to
> > the default value of hbase.hregion.max.filesize I
> >      have EXPECTED that it should be split into two regions 256MB and
> > 144MB. But the UI on port 60010 showed that the
> >      "mytable" has 3 regions (107MB+89MB+223MB). Why 3 not 2?
>
> HBase doesn't split exactly when it reaches 256MB and exactly at
> 128MB. You could easily have a daughter split getting 180MB and
> getting another split giving you a 89MB region and another one at
> 223MB.
>
> >
> > Q2- The UI of the hbase master on port 60010, showed the three regions of
> > "mytable" each with a start key and end key. I
> >      noticed  as well that the three regions are stored on the same
> > regionserver.The other regionservers stored the ROOT and
> >      the META. Shouldn't the regions of "mytable" be equally distributed
> > and stored on all region servers?
>
> Nope, if we did that we would be constantly moving the regions around
> at the slightest change. We included some "sloppiness" so when you
> have a very small region count it's obvious but when you get around
> the real numbers (more than 100) you will see it rather well
> distributed. It's not perfect tho, we can still improve a lot of
> stuff.
>
> >
> > Q3-The job is taking around 1 minute to finish, I can see that the reduce
> > function is very slow, could you give me some hints
> >      how can I make it faster? In which case should I think about
> splitting
> > the Job into 2? Something else I have to try to
> >      enhance the performance?
>
> What is your reduce even doing? How many reduce tasks do you set? What
> would be in your opinion a fast enough reduce phase?
>
> >
> > Regards,
> > CJ
> >
>

Re: Should the regions of a table be more or less equally distributed and stored?

Reply via email to