I found something in the logs on the nodes where I had a problem. A ContentNotFoundException begins occurring on these nodes and after many thousands of times we eventually get "too many open files". Once I do surgery on the content repository so that ContentNotFoundException stops happening, then lsof tells me that I'm not 'leaking' any more (deleted) file descriptors. I have no idea why the ContentNotFoundException starts happening, but it sounds similar to Tony Kurc's "heisenbug" [1] reported back in March.
Here's a stack trace of the ContentNotFoundException 2016-04-27 12:00:00,00 ERROR [Timer-Driven Process Thread-1] MyCustomProcessor org.apache.nifi.processor.exception.MissingFlowFileException: Unable to find content for FlowFile at o.a.n.controller.repository.StandardProcessSession.handleContentNotFound(StandardProcessSession.java:2349) at o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2186) at MyCustomProcessor.onTrigger() at o.a.n.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27) Caused by: o.a.n.controller.repository.ContentNotFoundException: Cound not find content for StandardContentClaim [resourceClaim=StandardResourceClaim[id=123456789-1, container=default, section=1], offset=351144, length=30720] at o.a.n.controller.repository.FileSystemRepository.getPath(FileSystemRepository.java:492) at o.a.n.controller.repository.FileSystemRepository.read(FileSystemRepository.java:801) at o.a.n.controller.repository.StandardProcessSession.getInputStream(StandardProcessSession.java:1743) at o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2152) ... common frames omitted MyCustomProcessor just calls session.write() with a StreamCallback implementation that reads from an InputStream and writes to an OutputStream. [1] - http://apache-nifi-developer-list.39713.n7.nabble.com/heisenbug-causing-quot-lost-quot-content-claims-td7775.html -- Mike On Wed, Apr 27, 2016 at 12:15 PM, Joe Witt <[email protected]> wrote: > Mike > > Ok that is a good data point. In my case they are all in archive but > I do agree that isn't super meaningful because in reality nothing > should ever be open for writing in the archive. > > If you can and have enough logging on try searching for that first > part of the filename in your logs. Let's see if we can narrow this to > a particular type of flow or something. > > Thanks > Joe > > On Wed, Apr 27, 2016 at 12:04 PM, Michael Moser <[email protected]> > wrote: > > Another data point ... we had archiving turned on at first, and then most > > (but not all) files that lsof reported were > > > > /content_repository/0/archive/123456789-123456 (deleted). > > > > We turned archiving off, hoping that was related in some way, but it was > > not. > > > > -- Mike > > > > > > On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt <[email protected]> wrote: > > > >> Mike, > >> > >> Definitely does not sound familiar. However, just looked up what you > >> describe and I do see it. In my case there are only three files but > >> they are sitting there open for writing by the nifi process and yet > >> have been deleted. So I do believe there is an issue...will dig in a > >> bit but obviously if you make more discoveries here please share. > >> > >> Thanks > >> Joe > >> > >> > >> > >> On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser <[email protected]> > >> wrote: > >> > Devs, > >> > > >> > We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster. We > noticed > >> > half of our cluster nodes getting "too many open files" errors that > >> require > >> > a NiFi restart, while the other half works without this problem. > Using > >> > 'lsof -p <pid>' to identify the open file descriptors at the time of > the > >> > problem, we see most of the file descriptors reference deleted files > in > >> the > >> > content repository like this: > >> > > >> > java <pid> <user> <fd> ... /content_repository/81/123456789-123456 > >> (deleted) > >> > > >> > A 'ls /content_repository/81/123456789-123456' confirms that the file > has > >> > been deleted. > >> > > >> > We are continuing our investigation into why some of our nodes have a > >> > problem but others don't. Has anyone else seen this? Did anything > >> change > >> > between 0.4.1 and 0.5.1 related to deleting files from the content > >> > repository? > >> > > >> > Regards, > >> > -- Mike > >> >
