[Lucene-hadoop Wiki] Update of "Hbase/HbaseArchitecture" by JimKellerman

Apache Wiki Sat, 30 Jun 2007 10:00:49 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for 
change notification.


The following page has been changed by JimKellerman:
http://wiki.apache.org/lucene-hadoop/Hbase/HbaseArchitecture

------------------------------------------------------------------------------
  [[Anchor(status)]]
  = Current Status =
  
- As of this writing (2007/05/30), there are approximately 11,500 lines of code 
in 
+ As of this writing (2007/06/30), there are approximately 16,500 lines of code 
in 
  "src/contrib/hbase/src/java/org/apache/hadoop/hbase/" directory on the Hadoop 
SVN trunk.
  
- There are also about 2800 lines of test cases.
+ There are also about 4000 lines of test cases.
  
  All of the single-machine operations (safe-committing, merging,
  splitting, versioning, flushing, compacting, log-recovery) are
  complete, have been tested, and seem to work great.
  
  The multi-machine stuff (the HMaster, the H!RegionServer, and the
- HClient) are in the process of being debugged.
+ HClient) are actively being enhanced and debugged.
  
  Other related features and TODOs:
-  1. We need easy interfaces to !MapReduce jobs, so they can scan tables. We 
have been contacted by Vuk Ercegovac [[MailTo(vercego AT SPAMFREE us DOT ibm 
DOT com)]] of IBM Almaden Research who expressed an interest in working on an 
HBase interface to  Hadoop map/reduce.
-  1. Vuk Ercegovac also pointed out that keeping HBase HRegion edit logs in 
HDFS is currently flawed.  HBase writes edits to logs and to a memcache.  The 
'atomic' write to the log is meant to serve as insurance against abnormal 
!RegionServer exit: on startup, the log is rerun to reconstruct an HRegion's 
last wholesome state. But files in HDFS do not 'exist' until they are cleanly 
closed -- something that will not happen if !RegionServer exits without running 
its 'close'.
+  1. Vuk Ercegovac [[MailTo(vercego AT SPAMFREE us DOT ibm DOT com)]] of IBM 
Almaden Research pointed out that keeping HBase HRegion edit logs in HDFS is 
currently flawed.  HBase writes edits to logs and to a memcache.  The 'atomic' 
write to the log is meant to serve as insurance against abnormal !RegionServer 
exit: on startup, the log is rerun to reconstruct an HRegion's last wholesome 
state. But files in HDFS do not 'exist' until they are cleanly closed -- 
something that will not happen if !RegionServer exits without running its 
'close'.
   1. The HMemcache lookup structure is relatively inefficient
   1. File compaction is relatively slow; we should have a more conservative 
algorithm for deciding when to apply compaction.  Same for region splits.
   1. For the getFull() operation, use of Bloom filters would speed things up 
(See [https://issues.apache.org/jira/browse/HADOOP-1415 HADOOP-1415])

[Lucene-hadoop Wiki] Update of "Hbase/HbaseArchitecture" by JimKellerman

Reply via email to