Re: HDFS and long-running processes

Todd Lipcon Mon, 06 Jul 2009 11:02:51 -0700

On Sat, Jul 4, 2009 at 9:08 AM, David B. Ritch <[email protected]>wrote:

> Thanks, Todd.  Perhaps I was misinformed, or misunderstood.  I'll make
> sure I close files occasionally, but it's good to know that the only
> real issue is with data recovery after losing a node.
>

Just to be clear, there aren't issues with data recovery of already-written
files. The issue is that, when you open a new file to write it, Hadoop sets
up a pipeline that looks something like:

Writer -> DN A -> DN B -> DN C

Where each of DN [ABC] are datanodes in your HDFS cluster. If Writer is also
a node in your HDFS cluster it will attempt to make DN A be the same machine
as Writer.

If DN B fails, the write pipeline will reorganize itself to:

Writer -> DN A -> DN C

In theory I *believe* it's supposed to pick up a new datanode at this point
and tack it onto the end, but I'm not certain this is implemented quite yet.
Maybe Dhruba or someone else with more knowledge here can chime in.

So, in the absence of that feature, if all three datanodes fail before you
have closed the output stream, then your write will fail. This
triple-failure scenario is reasonably unlikely, but if you plan on trickling
data into a cluster over the course of weeks on a single output stream it
stands a decent chance of happening.

-Todd

>
> David
>
> On 7/3/2009 3:08 PM, Todd Lipcon wrote:
> > Hi David,
> >
> > I'm unaware of any issue that would cause memory leaks when a file is
> open
> > for read for a long time.
> >
> > There are some issues currently with write pipeline recovery when a file
> is
> > open for writing for a long time and the datanodes to which it's writing
> > fail. So, I would not recommend having a file open for write for longer
> than
> > several hours (depending on the frequency with which you expect
> failures).
> >
> > -Todd
> >
> > On Fri, Jul 3, 2009 at 11:20 AM, David B. Ritch <[email protected]
> >wrote:
> >
> >
> >> I have been told that it is not a good idea to keep HDFS files open for
> >> a long time.  The reason sounded like a memory leak in the name node -
> >> that over time, the resources absorbed by an open file will increase.
> >>
> >> Is this still an issue  with Hadoop-0,19.x and 0-20.x?  Was it ever an
> >> issue?
> >>
> >> I have an application that keeps a number of files open, and executes
> >> pseudo-random reads from them in response to externally generated
> >> queries.  Should it close and re-open active files that are open longer
> >> than a certain amount of time?  If so, how long is too long to keep a
> >> file open?  And why?
> >>
> >> Thanks!
> >>
> >> David
> >>
> >>
> >
> >
>
>

Re: HDFS and long-running processes

Reply via email to