Re: question about parallel get()

stack Mon, 18 May 2009 22:32:01 -0700

On Sun, May 17, 2009 at 10:53 PM, Yair Even-Zohar <[email protected]
> wrote:


> 1) EC2, medium server


OK.


>
> 2) 3 or 4 column families. From thousands to millions of columns
>


3 or 4 column families should be fine.  Are your doing a full row get or are
you getting individual columns on each fetch (Latter is faster).

Thousands to millions of columns per row will give you trouble in 0.19.x
hbase: https://issues.apache.org/jira/browse/HBASE-867.  HBase will run
slow.  Hopefully addressed in 0.20.0 hbase.


St.Ack



>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of
> stack
> Sent: Sunday, May 17, 2009 10:58 PM
> To: [email protected]
> Subject: Re: question about parallel get()
>
> On Sun, May 17, 2009 at 11:19 AM, Yair Even-Zohar
> <[email protected]
> > wrote:
>
> > I'd like to run an efficient table get() methods and retrieve about a
> > 1000 rows where each row includes about 4 columns (around 20 bytes per
> > cell) with several versions per column. I assume the longest wait is
> for
> > reading the row from the disk so I could parallelize these reads. Any
> > suggestions what would be the best method?
> >
> >
>
> 0.19.x hbase or TRUNK?
>
>
>
> >
> >
> > 1)       How many gets() should I be running in parallel?
> >
>
>
> Depends on how many disks and distribution of gets over nodes in the
> cluster.
>
>
>
> >
> > 2)       What's the best number of get() per region?
> >
>
>
> How many column families?  All in one column family?
>
>
>
> >
> > 3)       Should the row ids be randomized among the different regions?
> >
> >
> Its best, yes, to distribute your get load over the cluster if you can.
>
> Sorry for all the 'depends' and answering-questions with questions.  Its
> my
> culture (smile).
>
> St.Ack
>

Re: question about parallel get()

Reply via email to