Sean Mackrory created HBASE-22149:
-------------------------------------
Summary: HBOSS: A FileSystem implementation to provide HBase's
required semantics
Key: HBASE-22149
URL: https://issues.apache.org/jira/browse/HBASE-22149
Project: HBase
Issue Type: Bug
Reporter: Sean Mackrory
Assignee: Sean Mackrory
I've had some thoughts about how to solve the problem of running HBase on
object stores. There has been some thought in the past about adding the
required semantics to S3Guard, but I have some concerns about that. First, it's
mixing complicated solutions to different problems (bridging the gap between a
flat namespace and a hierarchical namespace vs. solving inconsistency). Second,
it's S3-specific, whereas other objects stores could use virtually identical
solutions. And third, we can't do things like atomic renames in a true sense.
There would have to be some trade-offs and it's better if we can solve that in
an HBase-specific module without mixing all that logic in with the rest of S3A.
Ideas to solve this above the FileSystem layer have been proposed and
considered (HBASE-20431, for one), and maybe that's the right way forward
long-term, but it certainly seems to be a hard problem and hasn't been done
yet. But I don't know enough of all the internal considerations to make much of
a judgment on that myself.
I propose a FileSystem implementation that wraps another FileSystem instance
and provides locking of FileSystem operations to ensure correct semantics.
Locking could quite possibly be done on the same ZooKeeper ensemble as an HBase
cluster already uses (I'm sure there are some performance considerations here
that deserve more attention). I've put together a proof-of-concept one which
I've tested some aspects of atomic renames and atomic file creates. Both of
these tests fail on a naked s3a instance. I've also done a small YCSB run
against a small cluster to sanity check other functionality and was successful.
I will post the patch, and my laundry list of things that still need work. The
WAL is still placed on HDFS, but the HBase root directory is otherwise on S3.
Note that my prototype is built on Hadoop's source tree right now. That's
purely for my convenience in putting it together quickly, as that's where I
mostly work. I actually think long-term, if this is accepted as a good
solution, it makes sense to live in HBase (or it's own repository). It only
depends on stable, public APIs in Hadoop and is targeted entirely at HBase's
needs, so it should be able to iterate on the HBase community's terms alone.
Another idea [[email protected]] proposed to me is that of an inode-based
FileSystem that keeps hierarchical metadata in a more appropriate store that
would allow the required transactions (maybe a special table in HBase could
provide that store itself for other tables), and stores the underlying files
with unique identifiers on S3. This allows renames to actually become fast
instead of just large atomic operations. It does however place a strong
dependency on the metadata store. I have not explored this idea much. My
current proof-of-concept has been pleasantly simple, so I think it's the right
solution unless it proves unable to provide the required performance
characteristics.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)