That's great, Mark. Thanks for looking into it. No I haven't been able to trace through the logs to identify the cause of the ContentNotFoundException yet. If I manage to find anything I will share what I find.
-- Mike On Thu, Apr 28, 2016 at 11:16 AM, Mark Payne <marka...@hotmail.com> wrote: > Mike, > > I am looking into this, and I was able to figure out how we could > potentially archive (and eventually > age off from the archive) a piece of data for which there is still an open > file handle. Specifically, > I am able to understand how this could happen only when we have a problem > reading a FlowFile > (such as ContentNotFoundException) during a call to ProcessSession.write() > with a StreamCallback. > > So exactly the situation that you laid out here. I will be looking into > how to address this. > > Do you have any idea what may have caused your ContentNotFoundException to > occur in the first > place? > > Thanks > -Mark > > > > On Apr 27, 2016, at 5:24 PM, Michael Moser <moser...@gmail.com> wrote: > > > > I found something in the logs on the nodes where I had a problem. A > > ContentNotFoundException begins occurring on these nodes and after many > > thousands of times we eventually get "too many open files". Once I do > > surgery on the content repository so that ContentNotFoundException stops > > happening, then lsof tells me that I'm not 'leaking' any more (deleted) > > file descriptors. I have no idea why the ContentNotFoundException starts > > happening, but it sounds similar to Tony Kurc's "heisenbug" [1] reported > > back in March. > > > > Here's a stack trace of the ContentNotFoundException > > > > 2016-04-27 12:00:00,00 ERROR [Timer-Driven Process Thread-1] > > MyCustomProcessor > > org.apache.nifi.processor.exception.MissingFlowFileException: Unable to > > find content for FlowFile > > at > > > o.a.n.controller.repository.StandardProcessSession.handleContentNotFound(StandardProcessSession.java:2349) > > at > > > o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2186) > > at MyCustomProcessor.onTrigger() > > at > o.a.n.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27) > > Caused by: o.a.n.controller.repository.ContentNotFoundException: Cound > not > > find content for StandardContentClaim > > [resourceClaim=StandardResourceClaim[id=123456789-1, container=default, > > section=1], offset=351144, length=30720] > > at > > > o.a.n.controller.repository.FileSystemRepository.getPath(FileSystemRepository.java:492) > > at > > > o.a.n.controller.repository.FileSystemRepository.read(FileSystemRepository.java:801) > > at > > > o.a.n.controller.repository.StandardProcessSession.getInputStream(StandardProcessSession.java:1743) > > at > > > o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2152) > > ... common frames omitted > > > > MyCustomProcessor just calls session.write() with a StreamCallback > > implementation that reads from an InputStream and writes to an > OutputStream. > > > > [1] - > > > http://apache-nifi-developer-list.39713.n7.nabble.com/heisenbug-causing-quot-lost-quot-content-claims-td7775.html > > > > -- Mike > > > > > > > > > > On Wed, Apr 27, 2016 at 12:15 PM, Joe Witt <joe.w...@gmail.com> wrote: > > > >> Mike > >> > >> Ok that is a good data point. In my case they are all in archive but > >> I do agree that isn't super meaningful because in reality nothing > >> should ever be open for writing in the archive. > >> > >> If you can and have enough logging on try searching for that first > >> part of the filename in your logs. Let's see if we can narrow this to > >> a particular type of flow or something. > >> > >> Thanks > >> Joe > >> > >> On Wed, Apr 27, 2016 at 12:04 PM, Michael Moser <moser...@gmail.com> > >> wrote: > >>> Another data point ... we had archiving turned on at first, and then > most > >>> (but not all) files that lsof reported were > >>> > >>> /content_repository/0/archive/123456789-123456 (deleted). > >>> > >>> We turned archiving off, hoping that was related in some way, but it > was > >>> not. > >>> > >>> -- Mike > >>> > >>> > >>> On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt <joe.w...@gmail.com> wrote: > >>> > >>>> Mike, > >>>> > >>>> Definitely does not sound familiar. However, just looked up what you > >>>> describe and I do see it. In my case there are only three files but > >>>> they are sitting there open for writing by the nifi process and yet > >>>> have been deleted. So I do believe there is an issue...will dig in a > >>>> bit but obviously if you make more discoveries here please share. > >>>> > >>>> Thanks > >>>> Joe > >>>> > >>>> > >>>> > >>>> On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser <moser...@gmail.com> > >>>> wrote: > >>>>> Devs, > >>>>> > >>>>> We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster. We > >> noticed > >>>>> half of our cluster nodes getting "too many open files" errors that > >>>> require > >>>>> a NiFi restart, while the other half works without this problem. > >> Using > >>>>> 'lsof -p <pid>' to identify the open file descriptors at the time of > >> the > >>>>> problem, we see most of the file descriptors reference deleted files > >> in > >>>> the > >>>>> content repository like this: > >>>>> > >>>>> java <pid> <user> <fd> ... /content_repository/81/123456789-123456 > >>>> (deleted) > >>>>> > >>>>> A 'ls /content_repository/81/123456789-123456' confirms that the file > >> has > >>>>> been deleted. > >>>>> > >>>>> We are continuing our investigation into why some of our nodes have a > >>>>> problem but others don't. Has anyone else seen this? Did anything > >>>> change > >>>>> between 0.4.1 and 0.5.1 related to deleting files from the content > >>>>> repository? > >>>>> > >>>>> Regards, > >>>>> -- Mike > >>>> > >> > >