My replies and questions inline.

On Apr 28, 2008, at 2:57 PM, Max Grigoriev wrote:

Hi there,

I'm making research to find right solution for our needs.
We need persistent layer for groups of social network.
These groups will have big amount of data ( ~100 GB) - users profiles, their
activities and etc.
100GB per group, or 100GB overall? How many groups?

And all job with these entities should be make online - user can ask to
unsubscribe him, or connect another users to him.
So we'll work with small pieces of big dataset not big data in offline -
like log parser.
We wants to have ability to make search of different table attributes and of
course scalability and failover.
What kind of search on different table attributes do you want to do? There are no general purpose secondary indexes in HBase, so you either have to do a full- or partial-table scan or put the search attribute in the primary key.

As far as failover, at the moment, HBase has good recovery for region servers, and no recovery for the master. That's something we're hoping to change in the future.

We need easy add/remove nodes in cluster without stopping entire system.
You can do this, and it's not that hard.


All of this can be done with Amazon SimpleDB but we don't want to depend on
external service. That's why we're looking for some 3d product.

We have such candidates:

   - HBase -
   - CouchDb
   - HyperTable
   - Own bicycle

Can you tell me is HBase will work for such system?
I think HBase can do what you need, but it'd be nice to have more details about what exactly you're going to do with it.

If we have 2 or 3 data centers and we loose connection between them - what
behavior of HBase will we see ?
Is your intent to run a single HBase instance across several data centers? At the moment, if a regionserver is cut off from the master, it will kill itself. This means that if you have your master at one location and regionservers at another, and you lose connectivity, your regionservers at the other locations will shut themselves down. There are solutions to this we've discussed in the past. However, I wonder if maybe the correct solution is not to partition across data centers. It's not something that we've discussed at great length yet, so there might be an easier way to do it than I'm thinking.

And when we restore connection in 1-2 hours - what should we expect from
HBase ?
This is where things would get sticky - how do you resolve conflicts in how data is being served, or worse, how it was split into regions? It seems inherently complicated and unpleasant.



Thank you.

Reply via email to