Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The following page has been changed by JimKellerman:
http://wiki.apache.org/hadoop/HBase/RoadMaps

------------------------------------------------------------------------------
  The definitive source for which issues are targeted for which release is the 
[https://issues.apache.org/jira/browse/HBASE?report=com.atlassian.jira.plugin.system.project:roadmap-panel
 HBase Jira]
  
+ The goal is to release a new major version of HBase within a month of a major 
release of Hadoop. This goal was added due to the long interval between the 
release of Hadoop 0.17 (2008/05/20) and the release of HBase 0.2 (2008/08/08). 
During that period, the only way to run HBase on Hadoop 0.17 was to build it 
from source.
+ 
+ Going forward, new releases of HBase will have the same version number as the 
Hadoop release so there will be less confusion over which release of HBase 
works with which version of Hadoop.
+ 
+ == Road Map ==
+ 
+ New features are planned in approximately six month windows. And are listed 
in approximate priority order.
+ 
+ === September 2008 - March 2008 ===
+ 
+  * Integrate Zookeeper
+ 
+  The Bigtable paper, Section 4., describes how Chubby, a distributed lock 
manager and repository-of-state, is used as the authority for the list of 
servers that make up a Bigtable cluster, the location of the root region, and 
as the repository for table schemas. Currently HBase has its master process run 
all of the services ascribed by the Bigtable paper to Chubby..Instead we would 
move these services out of the single-process HBase master to instead run in a 
zookeeper cluster.. Zookeeper is a Chubby near-clone that, like HBase, is a 
subproject of Hadoop. Integrating zookeeper will make cluster state robust 
against individual server failures and make for tidier state transitions.
+ 
+  * Make the HBase Store file format pluggable
+ 
+  Currently HBase is hardcoded to use the Hadoop !MapFile as its base file 
system type (i.e. the SSTable from the Bigtable paper). Experience has shown 
the Hadoop !MapFile type as suboptimal given HBase access patterns. For 
example, the !MapFile index is ignorant of HBase 'rows'. We would change HBase 
to instead run against a file interface 
[https://issues.apache.org/jira/browse/HBASE-61 (HBASE-61 Create an 
HBase-specific MapFile implementation)]. A configuration option would dictate 
which file format implementation an HBase instance would use, just as you can 
swap 'engines' in MySQL.
+ 
+  Once the abstraction work had finished, we would add a new file format to go 
against this interface to replace Hadoop !MapFile. The new file format would be 
more amenable to HBase I/O patterns. It will either be the TFile specified in 
the attachment to [https://issues.apache.org/jira/browse/HADOOP-3315 
HADOOP-3315 New binary format] or something similar.
+ 
+  * Redo HBase RPC
+ 
+  Profiling revealed that RPC calls are responsible for a large portion of the 
latency involved servicing requests. The HBase RPC is currently the RPC from 
Hadoop with minor modifications. Hadoop RPC was designed for the passing of 
occasional messages rather than for bulk data transfers (HDFS uses a different 
RPC mechanism for bulk data transfer). Among other unwanted attributes, at its 
core, the HBase RPC has a bottleneck such that it handles a single request at a 
time.  We'd replace our RPC with an asynchronous RPC better suited to the type 
of traffic carried by HBase communication.
+ 
+  * Batched Updates
+ 
+  We would add to the HBase client a means of batching updates at the client. 
Currently updates are run one-at-a-time. On invocation of the batching feature, 
the HBase client would take on updates until it hit a threshold or a flush was 
explicitly invoked. The client would then sort the buffered edits by region and 
pass them in bulk, concurrently, out to the appropriate region servers. This 
feature would improve bulk upload performance.
+ 
+  * In-memory tables
+ 
+  Implement serving one or all column families of a table from memory.
+ 
+  * Locality Groups
+ 
+  Section 6 of the Bigtable paper describes Locality Groups, a means of 
on-the-fly grouping column families. The group can be treated as though it were 
a single column family. In implementation, all Locality Group members are saved 
to the same Store in the file system rather than to a Store per column family 
as is done in HBase currently. A Locality Groups' berth can be widened or 
narrowed by the administrator as usage patterns evolve without need of schema 
change and attendant re-upload. At their maximum spread, all families would be 
part of a single Locality Group. In this configuration, HBase would act like a 
row-orientated store. At their narrowest, a Locality Group would map to a 
single column family.
+ 
+  * Data-Locality Awareness
+ 
+  The Hadoop map reduce framework does makes a best effort at running tasks on 
the server hosting the task data after the dictum that its cheaper moving the 
processing to the data rather than the inverse. HBase needs smarts to assign 
regions to the region server that is running on the server hosting the regions' 
data. HBase needs to supply map reduce hints such that the Hadoop framework 
runs tasks beside the region server hosting the task input. These changes will 
make for savings in network I/O.
+ 
+  * Secondary indices
+ 
+  The Bigtable primary key is the lexicographically sorted row. Add a means of 
adding secondary indices to a table.
+ 
+  * Access control
+ 
+  Bigtable can control user access at the column family level. Leverage the 
Hadoop access control mechanism.
+ 
+  * Master fail over
+ 
+  An HBase cluster has a single master. If the master fails, the cluster shuts 
down. Develop a robust master failover.
+ 
+ == Past Releases ==
+ 
+ === 0.2.0 ===
+ 
+ [:Hbase/Plan-0.2: Roadmap for HBase 0.2] 
+ 
+ Overall, much progress was made towards the goal of enhancing robustness and 
scalability. 293 issues were identified and fixed. However, a couple of key 
priorities were not addressed:
+ 
+  * "Too many open file handles" This mostly a Hadoop HDFS problem, although 
some of the pressure can be relieved by changing the HBase RPC.
+ 
+  * Taking advantage of Hadoop append support. Minimal append support did not 
appear in Hadoop until Hadoop 0.18. While what is there is enough for HBase, 
this will probably be pushed into HBase 0.19 as we are quickly approaching the 
one month lag between Hadoop and HBase releases.
+ 
+  * Checkpointing/Syncing was deferred until Hadoop support is available.
+ 

Reply via email to