[ 
https://issues.apache.org/jira/browse/HADOOP-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Telloli updated HADOOP-5189:
---------------------------------

    Release Note: This patch breaks the file-based logging system.  
    Hadoop Flags: [Incompatible change]
          Status: Patch Available  (was: Open)

Integrating Bookkeeper into HDFS 
--------------------------------

INTRO
------
BookKeeper is a system to reliably log streams of records 
(https://issues.apache.org/jira/browse/ZOOKEEPER-276). The NameNode is a 
natural target for such a system for being the metadata repository of the 
entire file system for HDFS. 

The NameNode works by logging any modification to the file system on a separate 
stream and periodically merge the stream with the latest image of the file 
system. The standard version of HDFS only supports file-based logging on any 
number of devices. This patch provides BookKeeper logging. 

The advantages of BookKeeper-logging over file-logging are: 
        * higher throughput performance in large deployments 
        * higher availability though externally recoverable log files (ledgers)

This patch provides: 
        * a new abstraction for logging (also filed as 
https://issues.apache.org/jira/browse/HADOOP-5188)
        * an integration between Hadoop's HDFS and BookKeeper

HOW TO APPLY THE PATCH 
-------------------------
The patch is available against Hadoop release 0.19 
(http://svn.apache.org/repos/asf/hadoop/core/tags/release-0.19.0/). This patch 
includes https://issues.apache.org/jira/browse/HADOOP-5188 (Modifications to 
enable multiple types of logging)

The patch does not support file-based logging, nor it offers a way to switch 
from one system to the other, so please apply with caution. The patch does not 
support HDFS deployment older than 0.19 and has been tested *only* over 
pre-formatted file-systems; although it should work without many troubles, we 
suggest not to apply this patch to production systems.

To compile, just run ant. To compile and run properly you will need to add the 
zookeeper and bookkeeper jar to you hadoop/lib directory. Please refer to 
http://hadoop.apache.org/zookeeper/ for information on how to get, compile and 
run Zookeeper and Bookkeeper. 

HOW TO USE THE PATCHED HDFS 
----------------------------
One you patch Hadoop there will be some new properties to configure in 
hadoop-site.xml; a description follows: 

<property>
<!-- 
Specifies the type of logging to use. At the moment, only BKEditLogThreadBuf is 
provided.   
-->
<name>hdfs.editlog</name>
<value>BKEditLogThreadBuf</value>
</property> 

<property>
<!-- 
Specifies the number of bookies to use. Default is 3.    
-->
<name>bklog.bookies.total</name>
<value>3</value>
</property> 

<property>
<!-- 
Specifies the size of the quorum. Default is 2.    
-->
<name>bklog.bookies.quorumsize</name>
<value>2</value>
</property> 

<property>
<!-- 
Specifies the logging mode. A string in {"verifiable", "generic"}. For more 
information on this see TODO    
-->
<name>bklog.bookies.qmode</name>
<value>verifiable</value>
</property> 

<property>
<!-- 
Specifies the ZooKeeper server containing info about bookies   
-->
<name>bklog.zookeeper.servers</name>
<value>127.0.0.1</value>
</property> 

Once HDFS is configured, the subsequent steps are: 
* start ZooKeeper 
* start the bookies
* initialise ZooKeeper with the bookies' host (ZOOKEEPER-301 
[https://issues.apache.org/jira/browse/ZOOKEEPER-301]  provides a util class 
for this)
* start HDFS 
* ready to GO! 

PERFORMANCE
-----------
We tested our prototype against vanilla HDFS using a benchmark provided by the 
Hadoop folks and
available in the Hadoop distribution as 
org.apache.hadoop.hdfs.NNThroughputBenchmark. BookKeeper
logging was configured to use 3 bookies with a quorum of 2 in the verifiable 
mode. File-logging was
configured to write on a single device (FS) or two devices (FS-NFS, one of them 
being mounted through NFS).

The machines were dual core 4G with 2 300G SATA drives. 

The variables involved are the number of concurrent threads and the number of 
operations to log. We tested logging 400k ops of type create with NNThroughput 
benchmark, and the throughput (measured in Ops/s) is depicted in the attached 
graph. 

> Integration with BookKeeper logging system
> ------------------------------------------
>
>                 Key: HADOOP-5189
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5189
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.19.0
>            Reporter: Luca Telloli
>
> BookKeeper is a system to reliably log streams of records 
> (https://issues.apache.org/jira/browse/ZOOKEEPER-276). The NameNode is a 
> natural target for such a system for being the metadata repository of the 
> entire file system for HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to