Re: [Proposal] Pluggable Namespace

Milind Bhandarkar Mon, 07 Oct 2013 10:19:26 -0700

Thanks Bobby.

Experimentation with new namespace implementations and parallel development
is one of the main intents of starting this project from my end.


HDFS has improved a lot, and many of the perceived limitations, such as HA,
Performance, Snapshots, (limited) NFS connectivity have been addressed in
the last two years. I think the namespace scalability is the only checkbox
on that list which has not been fully checked. IMHO, allowing namespaces to
be pluggable, will allow folks to address that.

And I would like to state once again, that this work is orthogonal to
namenode federation, and co-exist with it.

- Milind


---
Milind Bhandarkar
Chief Scientist
Pivotal
+1-650-523-3858 (W)
+1-408-666-8483 (C)


On Mon, Oct 7, 2013 at 8:52 AM, Bobby Evans <ev...@yahoo-inc.com> wrote:

> Putting all conspiracy theories aside :).  Any way we decided to scale the
> name node is going to have limitations.  Federation currently has the
> problem that we cannot easily move data between different name nodes.  It
> is a static partitioning. It is not a blocker, but it can be annoying.  We
> can fix this, but to do so would require some sophisticated coordination
> between the name nodes involved.  If we put the namespace in a key/value
> store like Hbase there are likely to be mapping issues between a tree
> structure and a flat structure making some use cases, like very deep
> trees, potentially a lot slower.  It also does not scale the maximum
> number of operations per second a file system can do.  Because each has
> advantages and drawbacks it is important for us to enabled different use
> cases. This will allow for experimentation and parallel development and
> testing of new namespaces.  I though this was the original vision of
> federation.  Something where /tmp and /archive both co-exist together, but
> potentially have very different implementations to optimize for different
> use cases.
>
>
> Vinod,
>
> Yes block management has been separated out.  This is not about that, it
> is about providing a clean plugin point where someone can more easily take
> advantage of not just the block management code, but also the RPC and
> client code.
>
> --Bobby
>
>
>
> On 10/6/13 10:04 PM, "Mahadev Konar" <maha...@hortonworks.com> wrote:
>
> >Milind,
> > Am I missing something here? This was supposed to be a discussion and am
> >hoping thats why you started the thread. I don't see anywhere any
> >conspiracy theory being considered or being talked about. Vinod asked
> >some questions, if you can't or do not want to respond I suggest you skip
> >emailing or ignore rather than making false assumptions and accusations.
> >I hope the intent here is to contribute code and stays that way.
> >
> >thanks
> >mahadev
> >
> >On Oct 6, 2013, at 5:58 PM, Milind Bhandarkar <mbhandar...@gopivotal.com>
> >wrote:
> >
> >> Vinod,
> >>
> >> I have received a few emails about concerns that this effort somehow
> >> conflicts with federated namenodes. Most of these emails are from folks
> >> who are directly or remotely associated with Hortonworks.
> >>
> >> Three weeks ago, I sent emails about this effort to a few  Hadoop
> >> committers who are primarily focused on HDFS, whose email address I had.
> >> While 2 out of those three responded to me, the third person associated
> >> with Hortonworks, did not.
> >>
> >> Is Hortonworks concerned that this proposal conflicts with their
> >> development on federated namenode ? I have explicitly stated that it
> >>does
> >> not, and is orthogonal to federation. But I would like to know if there
> >> are some false assumptions being made about the intent of this
> >> development, and would like to quash any conspiracy theories right now,
> >> before they assume a life of their own.
> >>
> >> Thanks,
> >>
> >> Milind
> >>
> >>
> >> -----Original Message-----
> >> From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com]
> >> Sent: Sunday, October 06, 2013 12:21 PM
> >> To: hdfs-dev@hadoop.apache.org
> >> Subject: Re: [Proposal] Pluggable Namespace
> >>
> >> In order to make federation happen, the block pool management was
> >>already
> >> separated. Isn't that the same as this effortt?
> >>
> >> Thanks,
> >> +Vinod
> >>
> >> On Oct 6, 2013, at 9:35 AM, Milind Bhandarkar wrote:
> >>
> >>> Federation is orthogonal with Pluggable Namespaces. That is, one can
> >>> use Federation if needed, even while a distributed K-V store is used
> >>> on the backend.
> >>>
> >>> Limitations of Federated namenode for scaling namespace are
> >>> well-documented in several places, including the Giraffa presentation.
> >>>
> >>> HBase is only one of the several namespace implementations possible.
> >>> Thus, if HBase-based namespace implementation does not fit your
> >>> performance needs, you have a choice of using something else.
> >>>
> >>> - milind
> >>>
> >>> -----Original Message-----
> >>> From: Azuryy Yu [mailto:azury...@gmail.com]
> >>> Sent: Saturday, October 05, 2013 6:41 PM
> >>> To: hdfs-dev@hadoop.apache.org
> >>> Subject: Re: [Proposal] Pluggable Namespace
> >>>
> >>> Hi Milind,
> >>>
> >>> HDFS federation can solve the NN bottle neck and memory limit problem.
> >>>
> >>> AbstractNameSystem design sounds good. but distributed meta storage
> >>> using HBase should bring performance degration.
> >>> On Oct 4, 2013 3:18 AM, "Milind Bhandarkar"
> >>> <mbhandar...@gopivotal.com>
> >>> wrote:
> >>>
> >>>> Hi All,
> >>>>
> >>>> Exec Summary: For the last couple of months, we, at Pivotal, along
> >>>> with a couple of folks in the community have been working on making
> >>>> Namespace implementation in the namenode pluggable. We have
> >>>> demonstrated that it can be done without major surgery on the
> >>>> namenode, and does not have noticeable performance impact. We would
> >>>> like to contribute it back to Apache if there is sufficient interest.
> >>>> Please let us know if you are interested, and we will create a Jira
> >>>> and
> >>> update the patch for in-progress work.
> >>>>
> >>>>
> >>>> Rationale:
> >>>>
> >>>> In a Hadoop cluster, Namenode roughly has following main
> >>> responsibilities.
> >>>> . Catering to RPC calls from clients.
> >>>> . Managing the HDFS namespace tree.
> >>>> . Managing block report, heartbeat and other communication from data
> >>> nodes.
> >>>>
> >>>> For Hadoop clusters having large number of files and large number of
> >>>> nodes, name node gets bottlenecked. Mainly for two reasons . All the
> >>>> information is kept in name node's main memory.
> >>>> . Namenode has to cater to all the request from clients / data nodes.
> >>>> . And also perform some operations for backup and check pointing node.
> >>>>
> >>>> A possible solution is to add more main memory but there are certain
> >>>> issues with this approach . Namnenode being Java application, garbage
> >>>> collection cycles execute periodically to reclaim unreferenced heap
> >>>> space. When the heap space grows very large, despite of GC policy
> >>>> chosen, application stalls during the GC activity. This creates a
> >>>> bunch of issues since DNs and  clients may perceive this stall as NN
> >>>> crash.
> >>>> . There will always be a practical limit on how much physical memory
> >>>> a single machine can accommodate.
> >>>>
> >>>> Proposed Solution:
> >>>>
> >>>> Out of the three responsibilities listed above, we can refactor
> >>>> namespace management from the namenode codebase in such a way that
> >>>> there is provision to implement and plug other name systems other
> >>>> than existing in-process memory-based name system. Particularly a
> >>>> name system backed by a distributed key-value store will
> >>>> significantly reduce namenode memory requirement.To achieve this, a
> >>>> new generic interface will be introduced [Let's call it
> >>>> AbstractNameSystem] which defines set of operations using which we
> >>>> perform the namespace management. Namenode code that used to
> >>>> manipulate some java objects maintained in namenode's heap will now
> >> operate on this interface.
> >>>> There will be provision for others to extend this interface and plug
> >>> their own NameSystem implementation.
> >>>>
> >>>> To get started, we have implemented the same memory-based namespace
> >>>> implementation in a remote process, outside of the namenode JVM. In
> >>>> addition, work is undergoing to implement the namesystem using HBase.
> >>>>
> >>>> Details of Changes:
> >>>>
> >>>> Created new class called AbstractNamesystem, existing FSNamesystem is
> >>>> a subclass of this class. Some code from FSNamesystem has been moved
> >>>> to its parent. Created a Factory class to create object of NS
> >>>> management class.Factory refers to newly added config properties to
> >>>> support pluggable name space management class. Added unit tests for
> >>>> Factory. Replaced constructors with factory calls, this is  because
> >>>> the namesystem instances should now be created based on configuration.
> >>>> Added new config properties to support pluggable name space
> >>>> management class. This property will decide which Namesystem class
> >>>> will be instantiated by the factory. This change is also reflected in
> >>>> some DFS related webapps [JSP files] where namesystem instance is
> >>>> used to obtain
> >>> DFS health and other stats.
> >>>>
> >>>> These changes aim to make the namesystem pluggable without changing
> >>>> high level interfaces, this is particularly tricky since memory-based
> >>>> name system functionality is currently baked into these interfaces,
> >>>> and ultimate goal is to make the high level interface free from
> >>>> memory-based name system.
> >>>>
> >>>> Consideration for Upgrade and Rollback:
> >>>>
> >>>> Current memory based implementation already has code to read from and
> >>>> write to fsimage , we will have to make them publicly accessible
> >>>> which will enable us to upgrade an existing cluster from FSNamespace
> >>>> to newly added name system in future version.
> >>>>
> >>>> a. Upgrades: By making use of existing Loader class for reading
> >>>> fsimage we can write some code load this image into the future name
> >>>> system implementation.
> >>>>
> >>>> b. Rollback: Are even simpler, we can preserve the old fsimage and
> >>>> start the cluster with that image by configuring the cluster to use
> >>>> current file system based name system.
> >>>>
> >>>> Future work
> >>>>
> >>>> Current HDFS design is such that FSNameSystem is baked into even high
> >>>> level interfaces, this is a major hurdle in cleanly implementing
> >>>> pluggable name systems. We aim to propose a change in such interfaces
> >>>> into which FSNameSystem is tightly coupled.
> >>>>
> >>>> - Milind
> >>>>
> >>>>
> >>>> ---
> >>>> Milind Bhandarkar
> >>>> Chief Scientist
> >>>> Pivotal
> >>>>
> >>
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or entity
> >> to which it is addressed and may contain information that is
> >>confidential,
> >> privileged and exempt from disclosure under applicable law. If the
> >>reader
> >> of this message is not the intended recipient, you are hereby notified
> >> that any printing, copying, dissemination, distribution, disclosure or
> >> forwarding of this communication is strictly prohibited. If you have
> >> received this communication in error, please contact the sender
> >> immediately and delete it from your system. Thank You.
> >
> >
> >--
> >CONFIDENTIALITY NOTICE
> >NOTICE: This message is intended for the use of the individual or entity
> >to
> >which it is addressed and may contain information that is confidential,
> >privileged and exempt from disclosure under applicable law. If the reader
> >of this message is not the intended recipient, you are hereby notified
> >that
> >any printing, copying, dissemination, distribution, disclosure or
> >forwarding of this communication is strictly prohibited. If you have
> >received this communication in error, please contact the sender
> >immediately
> >and delete it from your system. Thank You.
>
>

Re: [Proposal] Pluggable Namespace

Reply via email to