Re: How would you use HBase in this scenario?

Ryan Rawson Wed, 04 Nov 2009 16:20:21 -0800

Hey,

HBase should work for your scenario. It will require a more than just
a few nodes, but should be doable.

So some answers:
- Generally as long as you dont have hot spots, adding nodes creates
linear scaling.
-- How long? Could be a while, there are not 1000 node HBase
installations yet, but there is nothing in the model that prevents it.
- Access time is a function of how long you have to wait for disk IO,
so this is controllable via how many disks you put in a node, and how
many nodes you have. Generally HBase can really kick out the data in
the data sizes you describe.
- 64 MB block size - reading a portion of a file doesnt trigger an
entire 64MB block read. Hadoop IO uses a chunk checksum system, and
the chunks are 64kb, thus you may need to read 2 * 64kb if your data
spans a checksum block.  This isn't nearly as a big deal as it seems.
- HBase includes in-line LZO compression which is really great. You
should use it.
- Writes...

The last question deserves a paragraph answer.  You're concerned about
writing to the 'end' of a table. In HBase once a table grows, the
entire table is split into multiple segments that are assigned to
servers. Each segment contains a continuous run of keys, eg: [A,B).
So thus to prevent a write-hot spot, you will want to not use a
sequence ID for a primary key, or anything that resembles it (ie:
timestamps for keys).  Instead if you can have a key that is more
random, or at least more spread out, then you get parallel write
options and you can achieve very high read/write speeds.

Can hbase support 60k read ops/sec?  This becomes  a function of how
much seeking and spindles you have, and how effective OS block caching
is. Adding more spindles either per node, or total nodes, this helps.
For pure-pure random reads, you are at the mercy of a 9ms seek, and
really reduces the number of reads/sec you can do.  This isn't a HBase
problem, you'll run into it on any system. Solutions are caching, more
spindles and possibly flash.

Good luck!
-ryan

On Wed, Nov 4, 2009 at 3:42 PM, Jason Strutz <[email protected]> wrote:
> I am evaluating replacing a homegrown file storage system with HBase.  Here
> are the stats on our current environment:
>  - Our workload is going to all be single record reads and writes
>  - We have 50TB of data, with each record being a10kb to 10mb in size,
> (average of 300kb), in a single column
>  - Peak 60k reads per hour
>  - Peak 20k writes per hour
>
> Here are the questions:
> What sort of hardware should I be looking at here?  Will capacity scale
> linearly as we add more servers, and for how long?
>
> Will I be able to get at least a 250ms access time with a reasonable cluster
> size?
>
> From what I understand, we're looking at a theoretical 64mb block read from
> disk for any row.  In practice, how significant is this, when taking in
> caching and other optimizations?
>
> We can do sequential writes, but at an expected total loss of an anticipated
> ~50% reduction in store size due to compression. I am also concerned that
> sequential writes at the end of the table will all end up writing to one
> disk - instead of distributing the load across all servers.
>
> Thanks in advance,
> -Jason
>

Re: How would you use HBase in this scenario?

Reply via email to