Hi, I am not 100% sure HDFS/HBase is what I exactly want. And here I state my requirement and please help me to evaluate it. I am still new to HDFS.
I build one search engine using nutch, and now I need to collect other related content by other systems. The question is where should I storge the *related content* and how to connect them with nutch index. As to the storage, two choices are available: HDFS/HBase and RDMS. I'd prefer the first one. And at the beginning, the content are only be appended w/o deleting, updating. And they should be retrieved by the key(something like foreign key in RDMS, and the stuff is connected to index.) If you think i am on the right way, could you help to give me some code snippets to show how to store the content? Big thanks - Scott. On 11/30/06, Scott Green <[EMAIL PROTECTED]> wrote:
Hi Mike Are you still working on HBase? and could you please introduce the project status to us? I have strong interesting in your idea :) - Scott On 5/16/06, Michael Cafarella <[EMAIL PROTECTED]> wrote: > Hi everyone, > > My previous mail mentioned a bunch of design ideas that were mainly > lifted from Jeff Dean's BigTable talk. BigTable seems like a useful > way to do large-scale row storage, and their decisions largely seem > like the right ones. > > However, BigTable still leaves some things on the table. Items to > improve include a query language and multi-row locking, among > other things. > > Dean said explicitly in his talk that they wanted to avoid multirow > locking because it's complicated, error-prone, and maybe not necessary. > He's right on at least the first two, and maybe the third. > > Multiple row locks are useful when you're making a change to > several rows that should be atomic; you want all the changes > or none of the changes. It's also used in traditional databases > if you want to perform an expensive read operation (like a > multiway join) and you want to make sure the results don't > get modified while you're reading. > > Distributed lock acquisition is very hard to do. It's bug-prone > and often has very weird performance ramifications. It's > difficult to get working, difficult to tune, difficult to everything. > > Here are a few ideas on what to do: > 1) Suck it up and have the client acquire locks on multiple > HRegionServers simultaneously. All clients would have to > agree to acquire locks according to some global ordering to > avoid deadlock. HRegions would not be allowed to migrate > to a new server if locked. > > If this is a rare circumstance, a better approach would be > to have a dedicated "lock acquirer" through which clients > make requests. It doesn't help the theoretical problem here, > but it would make debugging an awful lot easier. > > 2) In the case of long-lasting read operations, we can > use versioning to guarantee consistency. If each row is > annotated with an edit timestamp, and we know that there > is sufficient version history available, the long-lasting job > can run over a specific version only. > > Edits can continue to be made to the database while the > read-only job is ongoing. The operation is performed over > the database as of the time the task was submitted. > > 3) In the case of multiple row updates, we may be able to > use different edit semantics to avoid locking. For example, > consider that we want to add a single column/value pair to > multiple rows. We want this to happen atomically, so that > both rows get the value or neither of them do so. > > If it's just an add, then we don't need to lock the rows at > all; the add will always succeed, even if other writes > intervene. Traditionally there's been no difference between > among data "updates", so they all require locking. If we > can get a client to adjust the update semantics slightly, > then the locking can be much more relaxed. > > I'd say that "add" or "append" semantics are likely to be > at least as common as "edit" semantics. > > Can you think of the family of edit semantics you'd like > to see offered here? > > Also, how useful do you think a general-purpose query language > would be for HBase? It would be fairly straightforward to implement, > for example, a poor man's version of SQL that has different locking > and update behavior (and which chucks out the more exotic elements). > This might be compiled into a piece of code that is executed > immediately, or it might be transformed into a long-lasting mapreduce > job. > > I have a few ideas for such a language, but I'm worried it's getting > a little far afield from what we're interested in for Hadoop. > > --Mike > >
