I think it would be awesome to make this happen. I am +1 for the general direction. I am by no means an HDFS expert but I would like to see the patches you have made. If you could file a JIRA that points to the modifications on github or in patches and any design work you have explaining it (including what you have below) that would give us a place to review it and discuss it.
--Bobby On 10/3/13 2:17 PM, "Milind Bhandarkar" <mbhandar...@gopivotal.com> wrote: >Hi All, > >Exec Summary: For the last couple of months, we, at Pivotal, along with a >couple of folks in the community have been working on making Namespace >implementation in the namenode pluggable. We have demonstrated that it can >be done without major surgery on the namenode, and does not have >noticeable >performance impact. We would like to contribute it back to Apache if there >is sufficient interest. Please let us know if you are interested, and we >will create a Jira and update the patch for in-progress work. > > >Rationale: > >In a Hadoop cluster, Namenode roughly has following main responsibilities. >€ Catering to RPC calls from clients. >€ Managing the HDFS namespace tree. >€ Managing block report, heartbeat and other communication from data >nodes. > >For Hadoop clusters having large number of files and large number of >nodes, >name node gets bottlenecked. Mainly for two reasons >€ All the information is kept in name node¹s main memory. >€ Namenode has to cater to all the request from clients / data nodes. >€ And also perform some operations for backup and check pointing node. > >A possible solution is to add more main memory but there are certain >issues >with this approach >€ Namnenode being Java application, garbage collection cycles execute >periodically to reclaim unreferenced heap space. When the heap space grows >very large, despite of GC policy chosen, application stalls during the GC >activity. This creates a bunch of issues since DNs and clients may >perceive this stall as NN crash. >€ There will always be a practical limit on how much physical memory a >single machine can accommodate. > >Proposed Solution: > >Out of the three responsibilities listed above, we can refactor namespace >management from the namenode codebase in such a way that there is >provision >to implement and plug other name systems other than existing in-process >memory-based name system. Particularly a name system backed by a >distributed key-value store will significantly reduce namenode memory >requirement.To achieve this, a new generic interface will be introduced >[Let¹s call it AbstractNameSystem] which defines set of operations using >which we perform the namespace management. Namenode code that used to >manipulate some java objects maintained in namenode¹s heap will now >operate >on this interface. There will be provision for others to extend this >interface and plug their own NameSystem implementation. > >To get started, we have implemented the same memory-based namespace >implementation in a remote process, outside of the namenode JVM. In >addition, work is undergoing to implement the namesystem using HBase. > >Details of Changes: > >Created new class called AbstractNamesystem, existing FSNamesystem is a >subclass of this class. Some code from FSNamesystem has been moved to its >parent. Created a Factory class to create object of NS management >class.Factory refers to newly added config properties to support pluggable >name space management class. Added unit tests for Factory. Replaced >constructors with factory calls, this is because the namesystem instances >should now be created based on configuration. Added new config properties >to support pluggable name space management class. This property will >decide >which Namesystem class will be instantiated by the factory. This change is >also reflected in some DFS related webapps [JSP files] where namesystem >instance is used to obtain DFS health and other stats. > >These changes aim to make the namesystem pluggable without changing high >level interfaces, this is particularly tricky since memory-based name >system functionality is currently baked into these interfaces, and >ultimate >goal is to make the high level interface free from memory-based name >system. > >Consideration for Upgrade and Rollback: > >Current memory based implementation already has code to read from and >write >to fsimage , we will have to make them publicly accessible which will >enable us to upgrade an existing cluster from FSNamespace to newly added >name system in future version. > >a. Upgrades: By making use of existing Loader class for reading fsimage we >can write some code load this image into the future name system >implementation. > >b. Rollback: Are even simpler, we can preserve the old fsimage and start >the cluster with that image by configuring the cluster to use current file >system based name system. > >Future work > >Current HDFS design is such that FSNameSystem is baked into even high >level >interfaces, this is a major hurdle in cleanly implementing pluggable name >systems. We aim to propose a change in such interfaces into which >FSNameSystem is tightly coupled. > >- Milind > > >--- >Milind Bhandarkar >Chief Scientist >Pivotal