[Hadoop Wiki] Update of "HBase/RoadMaps" by JimKellerman

Apache Wiki Fri, 05 Sep 2008 12:16:45 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The following page has been changed by JimKellerman:
http://wiki.apache.org/hadoop/HBase/RoadMaps

------------------------------------------------------------------------------
The definitive source for which issues are targeted for which release is the
[https://issues.apache.org/jira/browse/HBASE?report=com.atlassian.jira.plugin.system.project:roadmap-panel
HBase Jira]

+ The goal is to release a new major version of HBase within a month of a major
release of Hadoop. This goal was added due to the long interval between the
release of Hadoop 0.17 (2008/05/20) and the release of HBase 0.2 (2008/08/08).
During that period, the only way to run HBase on Hadoop 0.17 was to build it
from source.
+
+ Going forward, new releases of HBase will have the same version number as the
Hadoop release so there will be less confusion over which release of HBase
works with which version of Hadoop.
+
+ == Road Map ==
+
+ New features are planned in approximately six month windows. And are listed
in approximate priority order.
+
+ === September 2008 - March 2008 ===
+
+ * Integrate Zookeeper
+
+ The Bigtable paper, Section 4., describes how Chubby, a distributed lock
manager and repository-of-state, is used as the authority for the list of
servers that make up a Bigtable cluster, the location of the root region, and
as the repository for table schemas. Currently HBase has its master process run
all of the services ascribed by the Bigtable paper to Chubby..Instead we would
move these services out of the single-process HBase master to instead run in a
zookeeper cluster.. Zookeeper is a Chubby near-clone that, like HBase, is a
subproject of Hadoop. Integrating zookeeper will make cluster state robust
against individual server failures and make for tidier state transitions.
+
+ * Make the HBase Store file format pluggable
+
+ Currently HBase is hardcoded to use the Hadoop !MapFile as its base file
system type (i.e. the SSTable from the Bigtable paper). Experience has shown
the Hadoop !MapFile type as suboptimal given HBase access patterns. For
example, the !MapFile index is ignorant of HBase 'rows'. We would change HBase
to instead run against a file interface
[https://issues.apache.org/jira/browse/HBASE-61 (HBASE-61 Create an
HBase-specific MapFile implementation)]. A configuration option would dictate
which file format implementation an HBase instance would use, just as you can
swap 'engines' in MySQL.
+
+ Once the abstraction work had finished, we would add a new file format to go
against this interface to replace Hadoop !MapFile. The new file format would be
more amenable to HBase I/O patterns. It will either be the TFile specified in
the attachment to [https://issues.apache.org/jira/browse/HADOOP-3315
HADOOP-3315 New binary format] or something similar.
+
+ * Redo HBase RPC
+
+ Profiling revealed that RPC calls are responsible for a large portion of the
latency involved servicing requests. The HBase RPC is currently the RPC from
Hadoop with minor modifications. Hadoop RPC was designed for the passing of
occasional messages rather than for bulk data transfers (HDFS uses a different
RPC mechanism for bulk data transfer). Among other unwanted attributes, at its
core, the HBase RPC has a bottleneck such that it handles a single request at a
time. We'd replace our RPC with an asynchronous RPC better suited to the type
of traffic carried by HBase communication.
+
+ * Batched Updates
+
+ We would add to the HBase client a means of batching updates at the client.
Currently updates are run one-at-a-time. On invocation of the batching feature,
the HBase client would take on updates until it hit a threshold or a flush was
explicitly invoked. The client would then sort the buffered edits by region and
pass them in bulk, concurrently, out to the appropriate region servers. This
feature would improve bulk upload performance.
+
+ * In-memory tables
+
+ Implement serving one or all column families of a table from memory.
+
+ * Locality Groups
+
+ Section 6 of the Bigtable paper describes Locality Groups, a means of
on-the-fly grouping column families. The group can be treated as though it were
a single column family. In implementation, all Locality Group members are saved
to the same Store in the file system rather than to a Store per column family
as is done in HBase currently. A Locality Groups' berth can be widened or
narrowed by the administrator as usage patterns evolve without need of schema
change and attendant re-upload. At their maximum spread, all families would be
part of a single Locality Group. In this configuration, HBase would act like a
row-orientated store. At their narrowest, a Locality Group would map to a
single column family.
+
+ * Data-Locality Awareness
+
+ The Hadoop map reduce framework does makes a best effort at running tasks on
the server hosting the task data after the dictum that its cheaper moving the
processing to the data rather than the inverse. HBase needs smarts to assign
regions to the region server that is running on the server hosting the regions'
data. HBase needs to supply map reduce hints such that the Hadoop framework
runs tasks beside the region server hosting the task input. These changes will
make for savings in network I/O.
+
+ * Secondary indices
+
+ The Bigtable primary key is the lexicographically sorted row. Add a means of
adding secondary indices to a table.
+
+ * Access control
+
+ Bigtable can control user access at the column family level. Leverage the
Hadoop access control mechanism.
+
+ * Master fail over
+
+ An HBase cluster has a single master. If the master fails, the cluster shuts
down. Develop a robust master failover.
+
+ == Past Releases ==
+
+ === 0.2.0 ===
+
+ [:Hbase/Plan-0.2: Roadmap for HBase 0.2]
+
+ Overall, much progress was made towards the goal of enhancing robustness and
scalability. 293 issues were identified and fixed. However, a couple of key
priorities were not addressed:
+
+ * "Too many open file handles" This mostly a Hadoop HDFS problem, although
some of the pressure can be relieved by changing the HBase RPC.
+
+ * Taking advantage of Hadoop append support. Minimal append support did not
appear in Hadoop until Hadoop 0.18. While what is there is enough for HBase,
this will probably be pushed into HBase 0.19 as we are quickly approaching the
one month lag between Hadoop and HBase releases.
+
+ * Checkpointing/Syncing was deferred until Hadoop support is available.
+

[Hadoop Wiki] Update of "HBase/RoadMaps" by JimKellerman

Reply via email to