On Sun, May 17, 2009 at 10:53 PM, Yair Even-Zohar <[email protected] > wrote:
> 1) EC2, medium server OK. > > 2) 3 or 4 column families. From thousands to millions of columns > 3 or 4 column families should be fine. Are your doing a full row get or are you getting individual columns on each fetch (Latter is faster). Thousands to millions of columns per row will give you trouble in 0.19.x hbase: https://issues.apache.org/jira/browse/HBASE-867. HBase will run slow. Hopefully addressed in 0.20.0 hbase. St.Ack > > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of > stack > Sent: Sunday, May 17, 2009 10:58 PM > To: [email protected] > Subject: Re: question about parallel get() > > On Sun, May 17, 2009 at 11:19 AM, Yair Even-Zohar > <[email protected] > > wrote: > > > I'd like to run an efficient table get() methods and retrieve about a > > 1000 rows where each row includes about 4 columns (around 20 bytes per > > cell) with several versions per column. I assume the longest wait is > for > > reading the row from the disk so I could parallelize these reads. Any > > suggestions what would be the best method? > > > > > > 0.19.x hbase or TRUNK? > > > > > > > > > 1) How many gets() should I be running in parallel? > > > > > Depends on how many disks and distribution of gets over nodes in the > cluster. > > > > > > > 2) What's the best number of get() per region? > > > > > How many column families? All in one column family? > > > > > > > 3) Should the row ids be randomized among the different regions? > > > > > Its best, yes, to distribute your get load over the cluster if you can. > > Sorry for all the 'depends' and answering-questions with questions. Its > my > culture (smile). > > St.Ack >
