Just FYI, 0.20 handles small cell values substantially better than 0.19.1. -ryan
On Wed, Apr 8, 2009 at 12:35 AM, Amandeep Khurana <[email protected]> wrote: > Hadoop and hbase are intelligent enough to balance the load. Its not very > frequent that you need to balance the load manually. Your cluster isnt > performing because of the low memory and the low limits on top of it. I > dont > think the load is a problem at all. > > Hadoop and Hbase are not designed for small data sizes and therefore dont > have the best performance when you have small files or small tables. The > most difficult part of hbase is starting up and increasing the table to a > certain threshold level. You'll encounter troubles in that phase (which you > already are). After that, its a breeze... > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > On Wed, Apr 8, 2009 at 12:29 AM, Rakhi Khatwani <[email protected] > >wrote: > > > Thanks, Amandeep > > > > One more question, i have mailed it earlier and i have attached the > > snapshot > > along with that email. > > I have noticed it that all my requests are handled by one region > server... > > Is there any way to balance the load? > > and will balancing the load improve the performance? > > > > PS: I have tried using hadoop load balancing but after some time some of > my > > region servers shut down... i have even gone through the archives and > > someone did report an unstable cluster due to load balancing. so i really > > dont know if i should turn load balancing on? > > > > Thanks, > > Raakhi > > > > On Wed, Apr 8, 2009 at 12:51 PM, Amandeep Khurana <[email protected]> > > wrote: > > > > > I'm not sure if I can answer that correctly or not. But my guess is no > it > > > wont hamper the performance. > > > > > > > > > Amandeep Khurana > > > Computer Science Graduate Student > > > University of California, Santa Cruz > > > > > > > > > On Wed, Apr 8, 2009 at 12:13 AM, Rakhi Khatwani < > > [email protected] > > > >wrote: > > > > > > > Hi Amandeep, > > > > > > > > But in That case, if I let hbase split it automatically, my table > with > > > > 17000 > > > > rows will have only one region. thus my analysis will have only one > > map. > > > > won't the analysis process be slower in that case?? > > > > > > > > Thanks, > > > > Raakhi > > > > > > > > On Wed, Apr 8, 2009 at 12:35 PM, Amandeep Khurana <[email protected]> > > > > wrote: > > > > > > > > > You cant compensate the RAM with processing power. Hbase keeps a > lot > > of > > > > > open > > > > > file handles in hdfs which needs memory so you need the RAM. > > > > > > > > > > Secondly, 17000 rows isnt much to cause a region split. I dont know > > > exact > > > > > numbers but I had a table with 6 million rows and only 3 regions. > So, > > > > thats > > > > > not a big deal. > > > > > > > > > > Thirdly, try with upping the xceivers and ulimit and see if it > works > > > with > > > > > the existing RAM... Thats the only way out. > > > > > > > > > > > > > > > Amandeep Khurana > > > > > Computer Science Graduate Student > > > > > University of California, Santa Cruz > > > > > > > > > > > > > > > On Wed, Apr 8, 2009 at 12:02 AM, Rakhi Khatwani < > > > > [email protected] > > > > > >wrote: > > > > > > > > > > > Hi Amandeep, > > > > > > > > > > > > Following is my ec2 cluster configuration: > > > > > > High-CPU Medium Instance 1.7 GB of memory, 5 EC2 Compute Units (2 > > > > virtual > > > > > > cores with 2.5 EC2 Compute Units each), 350 GB of instance > storage, > > > > > 32-bit > > > > > > platform > > > > > > > > > > > > so I don't think I have much option when it comes to the GB part. > > > > > iHowever, > > > > > > is there any way i can make use of 5ec2 compute units to increase > > my > > > > > > performance? > > > > > > > > > > > > Regarding the table splits, I dont see hbase doing the table > spilts > > > > > > automatically. > > > > > > After loading about 17000 rows in table1, I can still see it as > one > > > > > region > > > > > > (after checking it on web UI). thats why i had to manually split > > it. > > > or > > > > > is > > > > > > there any configuration/settings I have to do to ensure that the > > > tables > > > > > are > > > > > > split automatically? > > > > > > > > > > > > I will increase the dataXceivers and ulimit to 32k > > > > > > > > > > > > Thanks a ton > > > > > > Rakhi. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Amandeep, > > > > > > > > I have 1GB Memory on each node on ec2 > > cluster(C1 > > > > > > Medium) > > > > > > > . > > > > > > > > i am using hadoop-0.19.0 and hbase-0.19.0 > > > > > > > > well we were starting with 10,000 rows, but later it will go > up > > > to > > > > > > > 100,000 > > > > > > > > rows. > > > > > > > > > > > > > > > > > > > > > 1GB is too low. You need around 4GB to get a stable system. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > my map task basically reads an hbase table 'Table1', performs > > > > > analysis > > > > > > on > > > > > > > > each row, and dumps the analysis results into another hbase > > table > > > > > > > 'Table2'. > > > > > > > > each analysis task takes about 3-4 minutes when tested on > local > > > > > machine > > > > > > > > (the > > > > > > > > algorithm part.... w/o the map reduce). > > > > > > > > > > > > > > > > i have divided 'Table1' to 30 regions b4 sending it to the > map. > > > and > > > > > set > > > > > > > the > > > > > > > > maximum number of map tasks to 20. > > > > > > > > > > > > > > Let hbase do the division into regions. Leave the table as it > is > > in > > > > > > default > > > > > > > state. > > > > > > > > > > > > > > > > > > > > > > > i have set DataXceivers to 1024 and uLimit to 1024 > > > > > > > > > > > > > > yes.. increase these.. > > > > > > > 2048 dataxceivers and 32k ulimit. > > > > > > > > > > > > > > > > > > > > > > > i am able to process about 300 rows in an hour which i feel > > quite > > > > > > slow... > > > > > > > > how do i increase the performance. > > > > > > > > > > > > > > the reaons are mentioned above. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > meanwhile i will try settin the dataXceivers to 2048 and > > > increasing > > > > > the > > > > > > > > file > > > > > > > > limit as you mentioned. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Rakhi > > > > > > > > > > > > > > > > On Wed, Apr 8, 2009 at 11:40 AM, Amandeep Khurana < > > > > [email protected]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > 20 nodes is good enough to begins with. How much memory do > > you > > > > have > > > > > > on > > > > > > > > each > > > > > > > > > node? IMO, you should keep 1GB per daemon and 1GB for the > MR > > > job > > > > > like > > > > > > > > > Andrew > > > > > > > > > suggested. > > > > > > > > > You dont necessarily have to separate the datanodes and > > > > > tasktrackers > > > > > > as > > > > > > > > > long > > > > > > > > > as you have enough resources. > > > > > > > > > 10000 rows isnt big at all from hbase standpoint. What kind > > of > > > > > > > > computation > > > > > > > > > are you doing before dumping data into hbase? And what > > versions > > > > of > > > > > > > Hadoop > > > > > > > > > and Hbase are you running? > > > > > > > > > > > > > > > > > > There's another thing you should do. Increase the > > DataXceivers > > > > > limit > > > > > > to > > > > > > > > > 2048 > > > > > > > > > (thats what I use). > > > > > > > > > > > > > > > > > > If you have root privelege over the cluster, then increase > > the > > > > file > > > > > > > limit > > > > > > > > > to > > > > > > > > > 32k (see hbase faq for details). > > > > > > > > > > > > > > > > > > Try this out and see how it goes. > > > > > > > > > > > > > > > > > > > > > > > > > > > Amandeep Khurana > > > > > > > > > Computer Science Graduate Student > > > > > > > > > University of California, Santa Cruz > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 7, 2009 at 2:45 AM, Rakhi Khatwani < > > > > > > > [email protected] > > > > > > > > > >wrote: > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > I have a 20 node cluster on ec2(small instance).... > i > > > have > > > > a > > > > > > set > > > > > > > > of > > > > > > > > > > tables which store huge amount of data (tried wid 10,000 > > > > rows... > > > > > > more > > > > > > > > to > > > > > > > > > be > > > > > > > > > > added).... but during my map reduce jobs, some of the > > region > > > > > > servers > > > > > > > > shut > > > > > > > > > > down thereby causing data loss, stop in my program > > execution > > > > and > > > > > > > infact > > > > > > > > > one > > > > > > > > > > of my tables got damaged. when ever i scan the table, i > get > > > the > > > > > > could > > > > > > > > not > > > > > > > > > > obtain block error. > > > > > > > > > > > > > > > > > > > > 1. i want to make the cluster more robust. since it > > contains > > > a > > > > > lot > > > > > > of > > > > > > > > > data. > > > > > > > > > > and its really important that they remain stable. > > > > > > > > > > 2. if one of my tables gets damaged (even after > restarting > > > dfs > > > > n > > > > > > > > hbase), > > > > > > > > > > how > > > > > > > > > > do i go about recovering it? > > > > > > > > > > > > > > > > > > > > my ec2 cluster mostly has the default configuration. > > > > > > > > > > with hadoop-site n hbase-site have some entries > pertaining > > to > > > > > > > > map-reduce > > > > > > > > > > (for example. num of map tasks, mapred.task.timeout etc). > > > > > > > > > > > > > > > > > > > > Your help will be greatly appreciated. > > > > > > > > > > Thanks, > > > > > > > > > > Raakhi Khatwani > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
