Hi Milind, HDFS federation can solve the NN bottle neck and memory limit problem.
AbstractNameSystem design sounds good. but distributed meta storage using HBase should bring performance degration. On Oct 4, 2013 3:18 AM, "Milind Bhandarkar" <mbhandar...@gopivotal.com> wrote: > Hi All, > > Exec Summary: For the last couple of months, we, at Pivotal, along with a > couple of folks in the community have been working on making Namespace > implementation in the namenode pluggable. We have demonstrated that it can > be done without major surgery on the namenode, and does not have noticeable > performance impact. We would like to contribute it back to Apache if there > is sufficient interest. Please let us know if you are interested, and we > will create a Jira and update the patch for in-progress work. > > > Rationale: > > In a Hadoop cluster, Namenode roughly has following main responsibilities. > • Catering to RPC calls from clients. > • Managing the HDFS namespace tree. > • Managing block report, heartbeat and other communication from data nodes. > > For Hadoop clusters having large number of files and large number of nodes, > name node gets bottlenecked. Mainly for two reasons > • All the information is kept in name node’s main memory. > • Namenode has to cater to all the request from clients / data nodes. > • And also perform some operations for backup and check pointing node. > > A possible solution is to add more main memory but there are certain issues > with this approach > • Namnenode being Java application, garbage collection cycles execute > periodically to reclaim unreferenced heap space. When the heap space grows > very large, despite of GC policy chosen, application stalls during the GC > activity. This creates a bunch of issues since DNs and clients may > perceive this stall as NN crash. > • There will always be a practical limit on how much physical memory a > single machine can accommodate. > > Proposed Solution: > > Out of the three responsibilities listed above, we can refactor namespace > management from the namenode codebase in such a way that there is provision > to implement and plug other name systems other than existing in-process > memory-based name system. Particularly a name system backed by a > distributed key-value store will significantly reduce namenode memory > requirement.To achieve this, a new generic interface will be introduced > [Let’s call it AbstractNameSystem] which defines set of operations using > which we perform the namespace management. Namenode code that used to > manipulate some java objects maintained in namenode’s heap will now operate > on this interface. There will be provision for others to extend this > interface and plug their own NameSystem implementation. > > To get started, we have implemented the same memory-based namespace > implementation in a remote process, outside of the namenode JVM. In > addition, work is undergoing to implement the namesystem using HBase. > > Details of Changes: > > Created new class called AbstractNamesystem, existing FSNamesystem is a > subclass of this class. Some code from FSNamesystem has been moved to its > parent. Created a Factory class to create object of NS management > class.Factory refers to newly added config properties to support pluggable > name space management class. Added unit tests for Factory. Replaced > constructors with factory calls, this is because the namesystem instances > should now be created based on configuration. Added new config properties > to support pluggable name space management class. This property will decide > which Namesystem class will be instantiated by the factory. This change is > also reflected in some DFS related webapps [JSP files] where namesystem > instance is used to obtain DFS health and other stats. > > These changes aim to make the namesystem pluggable without changing high > level interfaces, this is particularly tricky since memory-based name > system functionality is currently baked into these interfaces, and ultimate > goal is to make the high level interface free from memory-based name > system. > > Consideration for Upgrade and Rollback: > > Current memory based implementation already has code to read from and write > to fsimage , we will have to make them publicly accessible which will > enable us to upgrade an existing cluster from FSNamespace to newly added > name system in future version. > > a. Upgrades: By making use of existing Loader class for reading fsimage we > can write some code load this image into the future name system > implementation. > > b. Rollback: Are even simpler, we can preserve the old fsimage and start > the cluster with that image by configuring the cluster to use current file > system based name system. > > Future work > > Current HDFS design is such that FSNameSystem is baked into even high level > interfaces, this is a major hurdle in cleanly implementing pluggable name > systems. We aim to propose a change in such interfaces into which > FSNameSystem is tightly coupled. > > - Milind > > > --- > Milind Bhandarkar > Chief Scientist > Pivotal >