Re: NiFi 0.5.1 "too many open files"

Michael Moser Thu, 28 Apr 2016 09:39:11 -0700

That's great, Mark.  Thanks for looking into it.  No I haven't been able to
trace through the logs to identify the cause of the
ContentNotFoundException yet.  If I manage to find anything I will share
what I find.


-- Mike


On Thu, Apr 28, 2016 at 11:16 AM, Mark Payne <marka...@hotmail.com> wrote:

> Mike,
>
> I am looking into this, and I was able to figure out how we could
> potentially archive (and eventually
> age off from the archive) a piece of data for which there is still an open
> file handle. Specifically,
> I am able to understand how this could happen only when we have a problem
> reading a FlowFile
> (such as ContentNotFoundException) during a call to ProcessSession.write()
> with a StreamCallback.
>
> So exactly the situation that you laid out here. I will be looking into
> how to address this.
>
> Do you have any idea what may have caused your ContentNotFoundException to
> occur in the first
> place?
>
> Thanks
> -Mark
>
>
> > On Apr 27, 2016, at 5:24 PM, Michael Moser <moser...@gmail.com> wrote:
> >
> > I found something in the logs on the nodes where I had a problem.  A
> > ContentNotFoundException begins occurring on these nodes and after many
> > thousands of times we eventually get "too many open files".  Once I do
> > surgery on the content repository so that ContentNotFoundException stops
> > happening, then lsof tells me that I'm not 'leaking' any more (deleted)
> > file descriptors.  I have no idea why the ContentNotFoundException starts
> > happening, but it sounds similar to Tony Kurc's "heisenbug" [1] reported
> > back in March.
> >
> > Here's a stack trace of the ContentNotFoundException
> >
> > 2016-04-27 12:00:00,00 ERROR [Timer-Driven Process Thread-1]
> > MyCustomProcessor
> > org.apache.nifi.processor.exception.MissingFlowFileException: Unable to
> > find content for FlowFile
> >  at
> >
> o.a.n.controller.repository.StandardProcessSession.handleContentNotFound(StandardProcessSession.java:2349)
> >  at
> >
> o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2186)
> >  at MyCustomProcessor.onTrigger()
> >  at
> o.a.n.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
> > Caused by: o.a.n.controller.repository.ContentNotFoundException: Cound
> not
> > find content for StandardContentClaim
> > [resourceClaim=StandardResourceClaim[id=123456789-1, container=default,
> > section=1], offset=351144, length=30720]
> >  at
> >
> o.a.n.controller.repository.FileSystemRepository.getPath(FileSystemRepository.java:492)
> >  at
> >
> o.a.n.controller.repository.FileSystemRepository.read(FileSystemRepository.java:801)
> >  at
> >
> o.a.n.controller.repository.StandardProcessSession.getInputStream(StandardProcessSession.java:1743)
> >  at
> >
> o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2152)
> >  ... common frames omitted
> >
> > MyCustomProcessor just calls session.write() with a StreamCallback
> > implementation that reads from an InputStream and writes to an
> OutputStream.
> >
> > [1] -
> >
> http://apache-nifi-developer-list.39713.n7.nabble.com/heisenbug-causing-quot-lost-quot-content-claims-td7775.html
> >
> > -- Mike
> >
> >
> >
> >
> > On Wed, Apr 27, 2016 at 12:15 PM, Joe Witt <joe.w...@gmail.com> wrote:
> >
> >> Mike
> >>
> >> Ok that is a good data point.  In my case they are all in archive but
> >> I do agree that isn't super meaningful because in reality nothing
> >> should ever be open for writing in the archive.
> >>
> >> If you can and have enough logging on try searching for that first
> >> part of the filename in your logs.  Let's see if we can narrow this to
> >> a particular type of flow or something.
> >>
> >> Thanks
> >> Joe
> >>
> >> On Wed, Apr 27, 2016 at 12:04 PM, Michael Moser <moser...@gmail.com>
> >> wrote:
> >>> Another data point ... we had archiving turned on at first, and then
> most
> >>> (but not all) files that lsof reported were
> >>>
> >>> /content_repository/0/archive/123456789-123456 (deleted).
> >>>
> >>> We turned archiving off, hoping that was related in some way, but it
> was
> >>> not.
> >>>
> >>> -- Mike
> >>>
> >>>
> >>> On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt <joe.w...@gmail.com> wrote:
> >>>
> >>>> Mike,
> >>>>
> >>>> Definitely does not sound familiar.  However, just looked up what you
> >>>> describe and I do see it.  In my case there are only three files but
> >>>> they are sitting there open for writing by the nifi process and yet
> >>>> have been deleted.  So I do believe there is an issue...will dig in a
> >>>> bit but obviously if you make more discoveries here please share.
> >>>>
> >>>> Thanks
> >>>> Joe
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser <moser...@gmail.com>
> >>>> wrote:
> >>>>> Devs,
> >>>>>
> >>>>> We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster.  We
> >> noticed
> >>>>> half of our cluster nodes getting "too many open files" errors that
> >>>> require
> >>>>> a NiFi restart, while the other half works without this problem.
> >> Using
> >>>>> 'lsof -p <pid>' to identify the open file descriptors at the time of
> >> the
> >>>>> problem, we see most of the file descriptors reference deleted files
> >> in
> >>>> the
> >>>>> content repository like this:
> >>>>>
> >>>>> java <pid> <user> <fd> ... /content_repository/81/123456789-123456
> >>>> (deleted)
> >>>>>
> >>>>> A 'ls /content_repository/81/123456789-123456' confirms that the file
> >> has
> >>>>> been deleted.
> >>>>>
> >>>>> We are continuing our investigation into why some of our nodes have a
> >>>>> problem but others don't.  Has anyone else seen this?  Did anything
> >>>> change
> >>>>> between 0.4.1 and 0.5.1 related to deleting files from the content
> >>>>> repository?
> >>>>>
> >>>>> Regards,
> >>>>> -- Mike
> >>>>
> >>
>
>

Re: NiFi 0.5.1 "too many open files"

Reply via email to