I found something in the logs on the nodes where I had a problem.  A
ContentNotFoundException begins occurring on these nodes and after many
thousands of times we eventually get "too many open files".  Once I do
surgery on the content repository so that ContentNotFoundException stops
happening, then lsof tells me that I'm not 'leaking' any more (deleted)
file descriptors.  I have no idea why the ContentNotFoundException starts
happening, but it sounds similar to Tony Kurc's "heisenbug" [1] reported
back in March.

Here's a stack trace of the ContentNotFoundException

2016-04-27 12:00:00,00 ERROR [Timer-Driven Process Thread-1]
MyCustomProcessor
org.apache.nifi.processor.exception.MissingFlowFileException: Unable to
find content for FlowFile
  at
o.a.n.controller.repository.StandardProcessSession.handleContentNotFound(StandardProcessSession.java:2349)
  at
o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2186)
  at MyCustomProcessor.onTrigger()
  at o.a.n.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
Caused by: o.a.n.controller.repository.ContentNotFoundException: Cound not
find content for StandardContentClaim
[resourceClaim=StandardResourceClaim[id=123456789-1, container=default,
section=1], offset=351144, length=30720]
  at
o.a.n.controller.repository.FileSystemRepository.getPath(FileSystemRepository.java:492)
  at
o.a.n.controller.repository.FileSystemRepository.read(FileSystemRepository.java:801)
  at
o.a.n.controller.repository.StandardProcessSession.getInputStream(StandardProcessSession.java:1743)
  at
o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2152)
  ... common frames omitted

MyCustomProcessor just calls session.write() with a StreamCallback
implementation that reads from an InputStream and writes to an OutputStream.

[1] -
http://apache-nifi-developer-list.39713.n7.nabble.com/heisenbug-causing-quot-lost-quot-content-claims-td7775.html

-- Mike




On Wed, Apr 27, 2016 at 12:15 PM, Joe Witt <[email protected]> wrote:

> Mike
>
> Ok that is a good data point.  In my case they are all in archive but
> I do agree that isn't super meaningful because in reality nothing
> should ever be open for writing in the archive.
>
> If you can and have enough logging on try searching for that first
> part of the filename in your logs.  Let's see if we can narrow this to
> a particular type of flow or something.
>
> Thanks
> Joe
>
> On Wed, Apr 27, 2016 at 12:04 PM, Michael Moser <[email protected]>
> wrote:
> > Another data point ... we had archiving turned on at first, and then most
> > (but not all) files that lsof reported were
> >
> > /content_repository/0/archive/123456789-123456 (deleted).
> >
> > We turned archiving off, hoping that was related in some way, but it was
> > not.
> >
> > -- Mike
> >
> >
> > On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt <[email protected]> wrote:
> >
> >> Mike,
> >>
> >> Definitely does not sound familiar.  However, just looked up what you
> >> describe and I do see it.  In my case there are only three files but
> >> they are sitting there open for writing by the nifi process and yet
> >> have been deleted.  So I do believe there is an issue...will dig in a
> >> bit but obviously if you make more discoveries here please share.
> >>
> >> Thanks
> >> Joe
> >>
> >>
> >>
> >> On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser <[email protected]>
> >> wrote:
> >> > Devs,
> >> >
> >> > We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster.  We
> noticed
> >> > half of our cluster nodes getting "too many open files" errors that
> >> require
> >> > a NiFi restart, while the other half works without this problem.
> Using
> >> > 'lsof -p <pid>' to identify the open file descriptors at the time of
> the
> >> > problem, we see most of the file descriptors reference deleted files
> in
> >> the
> >> > content repository like this:
> >> >
> >> > java <pid> <user> <fd> ... /content_repository/81/123456789-123456
> >> (deleted)
> >> >
> >> > A 'ls /content_repository/81/123456789-123456' confirms that the file
> has
> >> > been deleted.
> >> >
> >> > We are continuing our investigation into why some of our nodes have a
> >> > problem but others don't.  Has anyone else seen this?  Did anything
> >> change
> >> > between 0.4.1 and 0.5.1 related to deleting files from the content
> >> > repository?
> >> >
> >> > Regards,
> >> > -- Mike
> >>
>

Reply via email to