I am evaluating replacing a homegrown file storage system with HBase.
Here are the stats on our current environment:
- Our workload is going to all be single record reads and writes
- We have 50TB of data, with each record being a10kb to 10mb in
size, (average of 300kb), in a single column
- Peak 60k reads per hour
- Peak 20k writes per hour
Here are the questions:
What sort of hardware should I be looking at here? Will capacity
scale linearly as we add more servers, and for how long?
Will I be able to get at least a 250ms access time with a reasonable
cluster size?
From what I understand, we're looking at a theoretical 64mb block
read from disk for any row. In practice, how significant is this,
when taking in caching and other optimizations?
We can do sequential writes, but at an expected total loss of an
anticipated ~50% reduction in store size due to compression. I am also
concerned that sequential writes at the end of the table will all end
up writing to one disk - instead of distributing the load across all
servers.
Thanks in advance,
-Jason