Hey, HBase should work for your scenario. It will require a more than just a few nodes, but should be doable.
So some answers: - Generally as long as you dont have hot spots, adding nodes creates linear scaling. -- How long? Could be a while, there are not 1000 node HBase installations yet, but there is nothing in the model that prevents it. - Access time is a function of how long you have to wait for disk IO, so this is controllable via how many disks you put in a node, and how many nodes you have. Generally HBase can really kick out the data in the data sizes you describe. - 64 MB block size - reading a portion of a file doesnt trigger an entire 64MB block read. Hadoop IO uses a chunk checksum system, and the chunks are 64kb, thus you may need to read 2 * 64kb if your data spans a checksum block. This isn't nearly as a big deal as it seems. - HBase includes in-line LZO compression which is really great. You should use it. - Writes... The last question deserves a paragraph answer. You're concerned about writing to the 'end' of a table. In HBase once a table grows, the entire table is split into multiple segments that are assigned to servers. Each segment contains a continuous run of keys, eg: [A,B). So thus to prevent a write-hot spot, you will want to not use a sequence ID for a primary key, or anything that resembles it (ie: timestamps for keys). Instead if you can have a key that is more random, or at least more spread out, then you get parallel write options and you can achieve very high read/write speeds. Can hbase support 60k read ops/sec? This becomes a function of how much seeking and spindles you have, and how effective OS block caching is. Adding more spindles either per node, or total nodes, this helps. For pure-pure random reads, you are at the mercy of a 9ms seek, and really reduces the number of reads/sec you can do. This isn't a HBase problem, you'll run into it on any system. Solutions are caching, more spindles and possibly flash. Good luck! -ryan On Wed, Nov 4, 2009 at 3:42 PM, Jason Strutz <[email protected]> wrote: > I am evaluating replacing a homegrown file storage system with HBase. Here > are the stats on our current environment: > - Our workload is going to all be single record reads and writes > - We have 50TB of data, with each record being a10kb to 10mb in size, > (average of 300kb), in a single column > - Peak 60k reads per hour > - Peak 20k writes per hour > > Here are the questions: > What sort of hardware should I be looking at here? Will capacity scale > linearly as we add more servers, and for how long? > > Will I be able to get at least a 250ms access time with a reasonable cluster > size? > > From what I understand, we're looking at a theoretical 64mb block read from > disk for any row. In practice, how significant is this, when taking in > caching and other optimizations? > > We can do sequential writes, but at an expected total loss of an anticipated > ~50% reduction in store size due to compression. I am also concerned that > sequential writes at the end of the table will all end up writing to one > disk - instead of distributing the load across all servers. > > Thanks in advance, > -Jason >
