If you are opening up the discussion to HDFS, I would really like to think more deeply as to why HDFS is a better choice for some workloads than, say, Luster or GPFS. The things I like about HDFS over Luster is that 1) it is easier to set up 2) HDFS by default has local storage (as opposed to storage attached networks which is more typical for Luster deployments) making data locality for M/R jobs standard 3) HDFS lives in the Java world [which could be interpreted as a drawback I suppose]
Dave -----Original Message----- From: Jeff Hammerbacher [mailto:ham...@cloudera.com] Sent: Tuesday, May 11, 2010 3:29 PM To: hbase-user@hadoop.apache.org Subject: Re: Using HBase on other file systems Hey Edward, I do think that if you compare GoogleFS to HDFS, GFS looks more full > featured. > What features are you missing? Multi-writer append was explicitly called out by Sean Quinlan as a bad idea, and rolled back. From internal conversations with Google engineers, erasure coding of blocks suffered a similar fate. Native client access would certainly be nice, but FUSE gets you most of the way there. Scalability/availability of the NN, RPC QoS, alternative block placement strategies are second-order features which didn't exist in GFS until later in its lifecycle of development as well. HDFS is following a similar path and has JIRA tickets with active discussions. I'd love to hear your feature requests, and I'll be sure to translate them into JIRA tickets. I do believe my logic is reasonable. HBase has a lot of code designed around > HDFS. We know these tickets that get cited all the time, for better random > reads, or for sync() support. HBase gets the benefits of HDFS and has to > deal with its drawbacks. Other key value stores handle storage directly. > Sync() works and will be in the next release, and its absence was simply a result of the youth of the system. Now that that limitation has been removed, please point to another place in the code where using HDFS rather than the local file system is forcing HBase to make compromises. Your initial attempts on this front (caching, HFile, compactions) were, I hope, debunked by my previous email. It's also worth noting that Cassandra does all three, despite managing its own storage. I'm trying to learn from this exchange and always enjoy understanding new systems. Here's what I have so far from your arguments: 1) HBase inherits both the advantages and disadvantages of HDFS. I clearly agree on the general point; I'm pressing you to name some specific disadvantages, in hopes of helping prioritize our development of HDFS. So far, you've named things which are either a) not actually disadvantages b) no longer true. If you can come up with the disadvantages, we'll certainly take them into account. I've certainly got a number of them on our roadmap. 2) If you don't want to use HDFS, you won't want to use HBase. Also certainly true, but I'm not sure there's not much to learn from this assertion. I'd once again ask: why would you not want to use HDFS, and what is your choice in its stead? Thanks, Jeff