If you are opening up the discussion to HDFS, I would really like to think more 
deeply as to why HDFS is a better choice for some workloads than, say, Luster 
or GPFS.
The things I like about HDFS over Luster is that
1) it is easier to set up 
2) HDFS by default has local storage (as opposed to storage attached networks 
which is more typical for Luster deployments) making data locality for M/R jobs 
standard
3) HDFS lives in the Java world [which could be interpreted as a drawback I 
suppose]

Dave


-----Original Message-----
From: Jeff Hammerbacher [mailto:ham...@cloudera.com] 
Sent: Tuesday, May 11, 2010 3:29 PM
To: hbase-user@hadoop.apache.org
Subject: Re: Using HBase on other file systems

Hey Edward,

I do think that if you compare GoogleFS to HDFS, GFS looks more full
> featured.
>

What features are you missing? Multi-writer append was explicitly called out
by Sean Quinlan as a bad idea, and rolled back. From internal conversations
with Google engineers, erasure coding of blocks suffered a similar fate.
Native client access would certainly be nice, but FUSE gets you most of the
way there. Scalability/availability of the NN, RPC QoS, alternative block
placement strategies are second-order features which didn't exist in GFS
until later in its lifecycle of development as well. HDFS is following a
similar path and has JIRA tickets with active discussions. I'd love to hear
your feature requests, and I'll be sure to translate them into JIRA tickets.

I do believe my logic is reasonable. HBase has a lot of code designed around
> HDFS.  We know these tickets that get cited all the time, for better random
> reads, or for sync() support. HBase gets the benefits of HDFS and has to
> deal with its drawbacks. Other key value stores handle storage directly.
>

Sync() works and will be in the next release, and its absence was simply a
result of the youth of the system. Now that that limitation has been
removed, please point to another place in the code where using HDFS rather
than the local file system is forcing HBase to make compromises. Your
initial attempts on this front (caching, HFile, compactions) were, I hope,
debunked by my previous email. It's also worth noting that Cassandra does
all three, despite managing its own storage.

I'm trying to learn from this exchange and always enjoy understanding new
systems. Here's what I have so far from your arguments:
1) HBase inherits both the advantages and disadvantages of HDFS. I clearly
agree on the general point; I'm pressing you to name some specific
disadvantages, in hopes of helping prioritize our development of HDFS. So
far, you've named things which are either a) not actually disadvantages b)
no longer true. If you can come up with the disadvantages, we'll certainly
take them into account. I've certainly got a number of them on our roadmap.
2) If you don't want to use HDFS, you won't want to use HBase. Also
certainly true, but I'm not sure there's not much to learn from this
assertion. I'd once again ask: why would you not want to use HDFS, and what
is your choice in its stead?

Thanks,
Jeff

Reply via email to