Thanks Bobby. Experimentation with new namespace implementations and parallel development is one of the main intents of starting this project from my end.
HDFS has improved a lot, and many of the perceived limitations, such as HA, Performance, Snapshots, (limited) NFS connectivity have been addressed in the last two years. I think the namespace scalability is the only checkbox on that list which has not been fully checked. IMHO, allowing namespaces to be pluggable, will allow folks to address that. And I would like to state once again, that this work is orthogonal to namenode federation, and co-exist with it. - Milind --- Milind Bhandarkar Chief Scientist Pivotal +1-650-523-3858 (W) +1-408-666-8483 (C) On Mon, Oct 7, 2013 at 8:52 AM, Bobby Evans <ev...@yahoo-inc.com> wrote: > Putting all conspiracy theories aside :). Any way we decided to scale the > name node is going to have limitations. Federation currently has the > problem that we cannot easily move data between different name nodes. It > is a static partitioning. It is not a blocker, but it can be annoying. We > can fix this, but to do so would require some sophisticated coordination > between the name nodes involved. If we put the namespace in a key/value > store like Hbase there are likely to be mapping issues between a tree > structure and a flat structure making some use cases, like very deep > trees, potentially a lot slower. It also does not scale the maximum > number of operations per second a file system can do. Because each has > advantages and drawbacks it is important for us to enabled different use > cases. This will allow for experimentation and parallel development and > testing of new namespaces. I though this was the original vision of > federation. Something where /tmp and /archive both co-exist together, but > potentially have very different implementations to optimize for different > use cases. > > > Vinod, > > Yes block management has been separated out. This is not about that, it > is about providing a clean plugin point where someone can more easily take > advantage of not just the block management code, but also the RPC and > client code. > > --Bobby > > > > On 10/6/13 10:04 PM, "Mahadev Konar" <maha...@hortonworks.com> wrote: > > >Milind, > > Am I missing something here? This was supposed to be a discussion and am > >hoping thats why you started the thread. I don't see anywhere any > >conspiracy theory being considered or being talked about. Vinod asked > >some questions, if you can't or do not want to respond I suggest you skip > >emailing or ignore rather than making false assumptions and accusations. > >I hope the intent here is to contribute code and stays that way. > > > >thanks > >mahadev > > > >On Oct 6, 2013, at 5:58 PM, Milind Bhandarkar <mbhandar...@gopivotal.com> > >wrote: > > > >> Vinod, > >> > >> I have received a few emails about concerns that this effort somehow > >> conflicts with federated namenodes. Most of these emails are from folks > >> who are directly or remotely associated with Hortonworks. > >> > >> Three weeks ago, I sent emails about this effort to a few Hadoop > >> committers who are primarily focused on HDFS, whose email address I had. > >> While 2 out of those three responded to me, the third person associated > >> with Hortonworks, did not. > >> > >> Is Hortonworks concerned that this proposal conflicts with their > >> development on federated namenode ? I have explicitly stated that it > >>does > >> not, and is orthogonal to federation. But I would like to know if there > >> are some false assumptions being made about the intent of this > >> development, and would like to quash any conspiracy theories right now, > >> before they assume a life of their own. > >> > >> Thanks, > >> > >> Milind > >> > >> > >> -----Original Message----- > >> From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com] > >> Sent: Sunday, October 06, 2013 12:21 PM > >> To: hdfs-dev@hadoop.apache.org > >> Subject: Re: [Proposal] Pluggable Namespace > >> > >> In order to make federation happen, the block pool management was > >>already > >> separated. Isn't that the same as this effortt? > >> > >> Thanks, > >> +Vinod > >> > >> On Oct 6, 2013, at 9:35 AM, Milind Bhandarkar wrote: > >> > >>> Federation is orthogonal with Pluggable Namespaces. That is, one can > >>> use Federation if needed, even while a distributed K-V store is used > >>> on the backend. > >>> > >>> Limitations of Federated namenode for scaling namespace are > >>> well-documented in several places, including the Giraffa presentation. > >>> > >>> HBase is only one of the several namespace implementations possible. > >>> Thus, if HBase-based namespace implementation does not fit your > >>> performance needs, you have a choice of using something else. > >>> > >>> - milind > >>> > >>> -----Original Message----- > >>> From: Azuryy Yu [mailto:azury...@gmail.com] > >>> Sent: Saturday, October 05, 2013 6:41 PM > >>> To: hdfs-dev@hadoop.apache.org > >>> Subject: Re: [Proposal] Pluggable Namespace > >>> > >>> Hi Milind, > >>> > >>> HDFS federation can solve the NN bottle neck and memory limit problem. > >>> > >>> AbstractNameSystem design sounds good. but distributed meta storage > >>> using HBase should bring performance degration. > >>> On Oct 4, 2013 3:18 AM, "Milind Bhandarkar" > >>> <mbhandar...@gopivotal.com> > >>> wrote: > >>> > >>>> Hi All, > >>>> > >>>> Exec Summary: For the last couple of months, we, at Pivotal, along > >>>> with a couple of folks in the community have been working on making > >>>> Namespace implementation in the namenode pluggable. We have > >>>> demonstrated that it can be done without major surgery on the > >>>> namenode, and does not have noticeable performance impact. We would > >>>> like to contribute it back to Apache if there is sufficient interest. > >>>> Please let us know if you are interested, and we will create a Jira > >>>> and > >>> update the patch for in-progress work. > >>>> > >>>> > >>>> Rationale: > >>>> > >>>> In a Hadoop cluster, Namenode roughly has following main > >>> responsibilities. > >>>> . Catering to RPC calls from clients. > >>>> . Managing the HDFS namespace tree. > >>>> . Managing block report, heartbeat and other communication from data > >>> nodes. > >>>> > >>>> For Hadoop clusters having large number of files and large number of > >>>> nodes, name node gets bottlenecked. Mainly for two reasons . All the > >>>> information is kept in name node's main memory. > >>>> . Namenode has to cater to all the request from clients / data nodes. > >>>> . And also perform some operations for backup and check pointing node. > >>>> > >>>> A possible solution is to add more main memory but there are certain > >>>> issues with this approach . Namnenode being Java application, garbage > >>>> collection cycles execute periodically to reclaim unreferenced heap > >>>> space. When the heap space grows very large, despite of GC policy > >>>> chosen, application stalls during the GC activity. This creates a > >>>> bunch of issues since DNs and clients may perceive this stall as NN > >>>> crash. > >>>> . There will always be a practical limit on how much physical memory > >>>> a single machine can accommodate. > >>>> > >>>> Proposed Solution: > >>>> > >>>> Out of the three responsibilities listed above, we can refactor > >>>> namespace management from the namenode codebase in such a way that > >>>> there is provision to implement and plug other name systems other > >>>> than existing in-process memory-based name system. Particularly a > >>>> name system backed by a distributed key-value store will > >>>> significantly reduce namenode memory requirement.To achieve this, a > >>>> new generic interface will be introduced [Let's call it > >>>> AbstractNameSystem] which defines set of operations using which we > >>>> perform the namespace management. Namenode code that used to > >>>> manipulate some java objects maintained in namenode's heap will now > >> operate on this interface. > >>>> There will be provision for others to extend this interface and plug > >>> their own NameSystem implementation. > >>>> > >>>> To get started, we have implemented the same memory-based namespace > >>>> implementation in a remote process, outside of the namenode JVM. In > >>>> addition, work is undergoing to implement the namesystem using HBase. > >>>> > >>>> Details of Changes: > >>>> > >>>> Created new class called AbstractNamesystem, existing FSNamesystem is > >>>> a subclass of this class. Some code from FSNamesystem has been moved > >>>> to its parent. Created a Factory class to create object of NS > >>>> management class.Factory refers to newly added config properties to > >>>> support pluggable name space management class. Added unit tests for > >>>> Factory. Replaced constructors with factory calls, this is because > >>>> the namesystem instances should now be created based on configuration. > >>>> Added new config properties to support pluggable name space > >>>> management class. This property will decide which Namesystem class > >>>> will be instantiated by the factory. This change is also reflected in > >>>> some DFS related webapps [JSP files] where namesystem instance is > >>>> used to obtain > >>> DFS health and other stats. > >>>> > >>>> These changes aim to make the namesystem pluggable without changing > >>>> high level interfaces, this is particularly tricky since memory-based > >>>> name system functionality is currently baked into these interfaces, > >>>> and ultimate goal is to make the high level interface free from > >>>> memory-based name system. > >>>> > >>>> Consideration for Upgrade and Rollback: > >>>> > >>>> Current memory based implementation already has code to read from and > >>>> write to fsimage , we will have to make them publicly accessible > >>>> which will enable us to upgrade an existing cluster from FSNamespace > >>>> to newly added name system in future version. > >>>> > >>>> a. Upgrades: By making use of existing Loader class for reading > >>>> fsimage we can write some code load this image into the future name > >>>> system implementation. > >>>> > >>>> b. Rollback: Are even simpler, we can preserve the old fsimage and > >>>> start the cluster with that image by configuring the cluster to use > >>>> current file system based name system. > >>>> > >>>> Future work > >>>> > >>>> Current HDFS design is such that FSNameSystem is baked into even high > >>>> level interfaces, this is a major hurdle in cleanly implementing > >>>> pluggable name systems. We aim to propose a change in such interfaces > >>>> into which FSNameSystem is tightly coupled. > >>>> > >>>> - Milind > >>>> > >>>> > >>>> --- > >>>> Milind Bhandarkar > >>>> Chief Scientist > >>>> Pivotal > >>>> > >> > >> > >> -- > >> CONFIDENTIALITY NOTICE > >> NOTICE: This message is intended for the use of the individual or entity > >> to which it is addressed and may contain information that is > >>confidential, > >> privileged and exempt from disclosure under applicable law. If the > >>reader > >> of this message is not the intended recipient, you are hereby notified > >> that any printing, copying, dissemination, distribution, disclosure or > >> forwarding of this communication is strictly prohibited. If you have > >> received this communication in error, please contact the sender > >> immediately and delete it from your system. Thank You. > > > > > >-- > >CONFIDENTIALITY NOTICE > >NOTICE: This message is intended for the use of the individual or entity > >to > >which it is addressed and may contain information that is confidential, > >privileged and exempt from disclosure under applicable law. If the reader > >of this message is not the intended recipient, you are hereby notified > >that > >any printing, copying, dissemination, distribution, disclosure or > >forwarding of this communication is strictly prohibited. If you have > >received this communication in error, please contact the sender > >immediately > >and delete it from your system. Thank You. > >