So I have this issue (https://issues.apache.org/jira/browse/HBASE-2387) related 
to a strategy of building "cloud storage" on top of a secure (i.e. upcoming Y! 
security mods plus HBASE-1697) HBase+HDFS platform. HBase is really flexible, 
has some interesting features for the case, and use of it avoids scalability 
problems with total number of files in HDFS. 

The part of it I'd like to discuss now is this:

2) Emulate a filesystem, like s3fs 
(http://code.google.com/p/s3fs/wiki/FuseOverAmazon)
    * Translate paths under the mount point to row keys for good load 
spreading, /a/b/c/file.ext becomes file.ext/c/b/a
    * Consider borrowing from Tom White's Hadoop S3 FS (HADOOP-574), and store 
file data as blocks.
          o After fetching the inode can stream all blocks, e.g. via a Stargate 
multiget. This would support arbitrary file sizes. Otherwise there is a 
practical limit somewhere around 20-50 MB with default regionserver heaps.
          o So, file.ext/c/b/a gets the inode. Blocks would be keyed using the 
SHA-1 hash of their contents.
          o Use multiversioning on the inode to get snapshots for free: A path 
in the filesystem like /a/b/c/file.ext;timestamp gets file contents on or 
before timestamp.
          o Because new writes produce new blocks with unique hashes, this is 
like a dedup filesystem. Use ICV to maintain use counters on blocks.
    * Stargate multiget and multiput operations can help performance. I don't 
think Thrift has a comparable multi-op capability.

Multiversioning on the "inode" and reference counters on blocks wouldn't 
magically work together, so snapshot and dedup wouldn't coordinate without 
work. We could leave data blocks in place forever, then deep multiversioning of 
inodes wouldn't encounter any trouble. Or, the client side can run scanners 
over all inodes before a (file or snapshot level) delete and then use ICVs to 
determine which data blocks, if any, to issue deletes for. If a hosted service 
then the client wouldn't be the process issuing the deletes of course.

However this does raise an interesting and perhaps crazy thought. 

First, what about attaching metadata (itself KVs) to KVs in the store, in a way 
that it is efficient to look up the metadata for a given KV or set of KVs?

Second, what about the notion of references? For the case above specifically, 
metadata on an "inode" KV that consists of a list of pointers to other KVs. 
When deleting the "inode" KV -- one that fell off the tail of a stack of 
versions -- at compaction time, then the store could follow the pointers and 
delete the referenced values also. Or better, decrement a specified Long 
encoded KV and then take the delete action on another specified KV (or set of 
KVs) if the result <= 0. 

So just to be clear I'm not advocating building my use case into HBase -- it is 
a motivating example -- but rather there is perhaps some interesting generic 
primitives to consider here. They could support mechanisms for referential 
integrity that people coming from RDBMS are quite familiar with. 

Just thought I'd throw this out there, 

   - Andy



      

Reply via email to