I think it would be awesome to make this happen.  I am +1 for the general
direction. I am by no means an HDFS expert but I would like to see the
patches you have made.  If you could file a JIRA that points to the
modifications on github or in patches and any design work you have
explaining it (including what you have below) that would give us a place
to review it and discuss it.

--Bobby

On 10/3/13 2:17 PM, "Milind Bhandarkar" <mbhandar...@gopivotal.com> wrote:

>Hi All,
>
>Exec Summary: For the last couple of months, we, at Pivotal, along with a
>couple of folks in the community have been working on making Namespace
>implementation in the namenode pluggable. We have demonstrated that it can
>be done without major surgery on the namenode, and does not have
>noticeable
>performance impact. We would like to contribute it back to Apache if there
>is sufficient interest. Please let us know if you are interested, and we
>will create a Jira and update the patch for in-progress work.
>
>
>Rationale:
>
>In a Hadoop cluster, Namenode roughly has following main responsibilities.
>€ Catering to RPC calls from clients.
>€ Managing the HDFS namespace tree.
>€ Managing block report, heartbeat and other communication from data
>nodes.
>
>For Hadoop clusters having large number of files and large number of
>nodes,
>name node gets bottlenecked. Mainly for two reasons
>€ All the information is kept in name node¹s main memory.
>€ Namenode has to cater to all the request from clients / data nodes.
>€ And also perform some operations for backup and check pointing node.
>
>A possible solution is to add more main memory but there are certain
>issues
>with this approach
>€ Namnenode being Java application, garbage collection cycles execute
>periodically to reclaim unreferenced heap space. When the heap space grows
>very large, despite of GC policy  chosen, application stalls during the GC
>activity. This creates a bunch of issues since DNs and  clients may
>perceive this stall as NN crash.
>€ There will always be a practical limit on how much physical memory a
>single machine can accommodate.
>
>Proposed Solution:
>
>Out of the three responsibilities listed above, we can refactor namespace
>management from the namenode codebase in such a way that there is
>provision
>to implement and plug other name systems other than existing in-process
>memory-based name system. Particularly a name system backed by a
>distributed key-value store will significantly reduce namenode memory
>requirement.To achieve this, a new generic interface will be introduced
>[Let¹s call it AbstractNameSystem] which defines set of operations using
>which we perform the namespace management. Namenode code that used to
>manipulate some java objects maintained in namenode¹s heap will now
>operate
>on this interface. There will be provision for others to extend this
>interface and plug their own NameSystem implementation.
>
>To get started, we have implemented the same memory-based namespace
>implementation in a remote process, outside of the namenode JVM. In
>addition, work is undergoing to implement the namesystem using HBase.
>
>Details of Changes:
>
>Created new class called AbstractNamesystem, existing FSNamesystem is a
>subclass of this class. Some code from FSNamesystem has been moved to its
>parent. Created a Factory class to create object of NS management
>class.Factory refers to newly added config properties to support pluggable
>name space management class. Added unit tests for Factory. Replaced
>constructors with factory calls, this is  because the namesystem instances
>should now be created based on configuration. Added new config properties
>to support pluggable name space management class. This property will
>decide
>which Namesystem class will be instantiated by the factory. This change is
>also reflected in some DFS related webapps [JSP files] where namesystem
>instance is used to obtain DFS health and other stats.
>
>These changes aim to make the namesystem pluggable without changing high
>level interfaces, this is particularly tricky since memory-based name
>system functionality is currently baked into these interfaces, and
>ultimate
>goal is to make the high level interface free from memory-based name
>system.
>
>Consideration for Upgrade and Rollback:
>
>Current memory based implementation already has code to read from and
>write
>to fsimage , we will have to make them publicly accessible which will
>enable us to upgrade an existing cluster from FSNamespace to newly added
>name system in future version.
>
>a. Upgrades: By making use of existing Loader class for reading fsimage we
>can write some code load this image into the future name system
>implementation.
>
>b. Rollback: Are even simpler, we can preserve the old fsimage and start
>the cluster with that image by configuring the cluster to use current file
>system based name system.
>
>Future work
>
>Current HDFS design is such that FSNameSystem is baked into even high
>level
>interfaces, this is a major hurdle in cleanly implementing pluggable name
>systems. We aim to propose a change in such interfaces into which
>FSNameSystem is tightly coupled.
>
>- Milind
>
>
>---
>Milind Bhandarkar
>Chief Scientist
>Pivotal

Reply via email to