So I have this issue (https://issues.apache.org/jira/browse/HBASE-2387) related to a strategy of building "cloud storage" on top of a secure (i.e. upcoming Y! security mods plus HBASE-1697) HBase+HDFS platform. HBase is really flexible, has some interesting features for the case, and use of it avoids scalability problems with total number of files in HDFS.
The part of it I'd like to discuss now is this: 2) Emulate a filesystem, like s3fs (http://code.google.com/p/s3fs/wiki/FuseOverAmazon) * Translate paths under the mount point to row keys for good load spreading, /a/b/c/file.ext becomes file.ext/c/b/a * Consider borrowing from Tom White's Hadoop S3 FS (HADOOP-574), and store file data as blocks. o After fetching the inode can stream all blocks, e.g. via a Stargate multiget. This would support arbitrary file sizes. Otherwise there is a practical limit somewhere around 20-50 MB with default regionserver heaps. o So, file.ext/c/b/a gets the inode. Blocks would be keyed using the SHA-1 hash of their contents. o Use multiversioning on the inode to get snapshots for free: A path in the filesystem like /a/b/c/file.ext;timestamp gets file contents on or before timestamp. o Because new writes produce new blocks with unique hashes, this is like a dedup filesystem. Use ICV to maintain use counters on blocks. * Stargate multiget and multiput operations can help performance. I don't think Thrift has a comparable multi-op capability. Multiversioning on the "inode" and reference counters on blocks wouldn't magically work together, so snapshot and dedup wouldn't coordinate without work. We could leave data blocks in place forever, then deep multiversioning of inodes wouldn't encounter any trouble. Or, the client side can run scanners over all inodes before a (file or snapshot level) delete and then use ICVs to determine which data blocks, if any, to issue deletes for. If a hosted service then the client wouldn't be the process issuing the deletes of course. However this does raise an interesting and perhaps crazy thought. First, what about attaching metadata (itself KVs) to KVs in the store, in a way that it is efficient to look up the metadata for a given KV or set of KVs? Second, what about the notion of references? For the case above specifically, metadata on an "inode" KV that consists of a list of pointers to other KVs. When deleting the "inode" KV -- one that fell off the tail of a stack of versions -- at compaction time, then the store could follow the pointers and delete the referenced values also. Or better, decrement a specified Long encoded KV and then take the delete action on another specified KV (or set of KVs) if the result <= 0. So just to be clear I'm not advocating building my use case into HBase -- it is a motivating example -- but rather there is perhaps some interesting generic primitives to consider here. They could support mechanisms for referential integrity that people coming from RDBMS are quite familiar with. Just thought I'd throw this out there, - Andy