Re: NiFi 0.5.1 "too many open files"
Mike, I am looking into this, and I was able to figure out how we could potentially archive (and eventually age off from the archive) a piece of data for which there is still an open file handle. Specifically, I am able to understand how this could happen only when we have a problem reading a FlowFile (such as ContentNotFoundException) during a call to ProcessSession.write() with a StreamCallback. So exactly the situation that you laid out here. I will be looking into how to address this. Do you have any idea what may have caused your ContentNotFoundException to occur in the first place? Thanks -Mark > On Apr 27, 2016, at 5:24 PM, Michael Moserwrote: > > I found something in the logs on the nodes where I had a problem. A > ContentNotFoundException begins occurring on these nodes and after many > thousands of times we eventually get "too many open files". Once I do > surgery on the content repository so that ContentNotFoundException stops > happening, then lsof tells me that I'm not 'leaking' any more (deleted) > file descriptors. I have no idea why the ContentNotFoundException starts > happening, but it sounds similar to Tony Kurc's "heisenbug" [1] reported > back in March. > > Here's a stack trace of the ContentNotFoundException > > 2016-04-27 12:00:00,00 ERROR [Timer-Driven Process Thread-1] > MyCustomProcessor > org.apache.nifi.processor.exception.MissingFlowFileException: Unable to > find content for FlowFile > at > o.a.n.controller.repository.StandardProcessSession.handleContentNotFound(StandardProcessSession.java:2349) > at > o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2186) > at MyCustomProcessor.onTrigger() > at o.a.n.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27) > Caused by: o.a.n.controller.repository.ContentNotFoundException: Cound not > find content for StandardContentClaim > [resourceClaim=StandardResourceClaim[id=123456789-1, container=default, > section=1], offset=351144, length=30720] > at > o.a.n.controller.repository.FileSystemRepository.getPath(FileSystemRepository.java:492) > at > o.a.n.controller.repository.FileSystemRepository.read(FileSystemRepository.java:801) > at > o.a.n.controller.repository.StandardProcessSession.getInputStream(StandardProcessSession.java:1743) > at > o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2152) > ... common frames omitted > > MyCustomProcessor just calls session.write() with a StreamCallback > implementation that reads from an InputStream and writes to an OutputStream. > > [1] - > http://apache-nifi-developer-list.39713.n7.nabble.com/heisenbug-causing-quot-lost-quot-content-claims-td7775.html > > -- Mike > > > > > On Wed, Apr 27, 2016 at 12:15 PM, Joe Witt wrote: > >> Mike >> >> Ok that is a good data point. In my case they are all in archive but >> I do agree that isn't super meaningful because in reality nothing >> should ever be open for writing in the archive. >> >> If you can and have enough logging on try searching for that first >> part of the filename in your logs. Let's see if we can narrow this to >> a particular type of flow or something. >> >> Thanks >> Joe >> >> On Wed, Apr 27, 2016 at 12:04 PM, Michael Moser >> wrote: >>> Another data point ... we had archiving turned on at first, and then most >>> (but not all) files that lsof reported were >>> >>> /content_repository/0/archive/123456789-123456 (deleted). >>> >>> We turned archiving off, hoping that was related in some way, but it was >>> not. >>> >>> -- Mike >>> >>> >>> On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt wrote: >>> Mike, Definitely does not sound familiar. However, just looked up what you describe and I do see it. In my case there are only three files but they are sitting there open for writing by the nifi process and yet have been deleted. So I do believe there is an issue...will dig in a bit but obviously if you make more discoveries here please share. Thanks Joe On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser wrote: > Devs, > > We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster. We >> noticed > half of our cluster nodes getting "too many open files" errors that require > a NiFi restart, while the other half works without this problem. >> Using > 'lsof -p ' to identify the open file descriptors at the time of >> the > problem, we see most of the file descriptors reference deleted files >> in the > content repository like this: > > java... /content_repository/81/123456789-123456 (deleted) > > A 'ls /content_repository/81/123456789-123456' confirms that the file >> has > been deleted. > > We are continuing our investigation into why some of our nodes have a
Re: NiFi 0.5.1 "too many open files"
I found something in the logs on the nodes where I had a problem. A ContentNotFoundException begins occurring on these nodes and after many thousands of times we eventually get "too many open files". Once I do surgery on the content repository so that ContentNotFoundException stops happening, then lsof tells me that I'm not 'leaking' any more (deleted) file descriptors. I have no idea why the ContentNotFoundException starts happening, but it sounds similar to Tony Kurc's "heisenbug" [1] reported back in March. Here's a stack trace of the ContentNotFoundException 2016-04-27 12:00:00,00 ERROR [Timer-Driven Process Thread-1] MyCustomProcessor org.apache.nifi.processor.exception.MissingFlowFileException: Unable to find content for FlowFile at o.a.n.controller.repository.StandardProcessSession.handleContentNotFound(StandardProcessSession.java:2349) at o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2186) at MyCustomProcessor.onTrigger() at o.a.n.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27) Caused by: o.a.n.controller.repository.ContentNotFoundException: Cound not find content for StandardContentClaim [resourceClaim=StandardResourceClaim[id=123456789-1, container=default, section=1], offset=351144, length=30720] at o.a.n.controller.repository.FileSystemRepository.getPath(FileSystemRepository.java:492) at o.a.n.controller.repository.FileSystemRepository.read(FileSystemRepository.java:801) at o.a.n.controller.repository.StandardProcessSession.getInputStream(StandardProcessSession.java:1743) at o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2152) ... common frames omitted MyCustomProcessor just calls session.write() with a StreamCallback implementation that reads from an InputStream and writes to an OutputStream. [1] - http://apache-nifi-developer-list.39713.n7.nabble.com/heisenbug-causing-quot-lost-quot-content-claims-td7775.html -- Mike On Wed, Apr 27, 2016 at 12:15 PM, Joe Wittwrote: > Mike > > Ok that is a good data point. In my case they are all in archive but > I do agree that isn't super meaningful because in reality nothing > should ever be open for writing in the archive. > > If you can and have enough logging on try searching for that first > part of the filename in your logs. Let's see if we can narrow this to > a particular type of flow or something. > > Thanks > Joe > > On Wed, Apr 27, 2016 at 12:04 PM, Michael Moser > wrote: > > Another data point ... we had archiving turned on at first, and then most > > (but not all) files that lsof reported were > > > > /content_repository/0/archive/123456789-123456 (deleted). > > > > We turned archiving off, hoping that was related in some way, but it was > > not. > > > > -- Mike > > > > > > On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt wrote: > > > >> Mike, > >> > >> Definitely does not sound familiar. However, just looked up what you > >> describe and I do see it. In my case there are only three files but > >> they are sitting there open for writing by the nifi process and yet > >> have been deleted. So I do believe there is an issue...will dig in a > >> bit but obviously if you make more discoveries here please share. > >> > >> Thanks > >> Joe > >> > >> > >> > >> On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser > >> wrote: > >> > Devs, > >> > > >> > We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster. We > noticed > >> > half of our cluster nodes getting "too many open files" errors that > >> require > >> > a NiFi restart, while the other half works without this problem. > Using > >> > 'lsof -p ' to identify the open file descriptors at the time of > the > >> > problem, we see most of the file descriptors reference deleted files > in > >> the > >> > content repository like this: > >> > > >> > java... /content_repository/81/123456789-123456 > >> (deleted) > >> > > >> > A 'ls /content_repository/81/123456789-123456' confirms that the file > has > >> > been deleted. > >> > > >> > We are continuing our investigation into why some of our nodes have a > >> > problem but others don't. Has anyone else seen this? Did anything > >> change > >> > between 0.4.1 and 0.5.1 related to deleting files from the content > >> > repository? > >> > > >> > Regards, > >> > -- Mike > >> >
Re: NiFi 0.5.1 "too many open files"
Mike Ok that is a good data point. In my case they are all in archive but I do agree that isn't super meaningful because in reality nothing should ever be open for writing in the archive. If you can and have enough logging on try searching for that first part of the filename in your logs. Let's see if we can narrow this to a particular type of flow or something. Thanks Joe On Wed, Apr 27, 2016 at 12:04 PM, Michael Moserwrote: > Another data point ... we had archiving turned on at first, and then most > (but not all) files that lsof reported were > > /content_repository/0/archive/123456789-123456 (deleted). > > We turned archiving off, hoping that was related in some way, but it was > not. > > -- Mike > > > On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt wrote: > >> Mike, >> >> Definitely does not sound familiar. However, just looked up what you >> describe and I do see it. In my case there are only three files but >> they are sitting there open for writing by the nifi process and yet >> have been deleted. So I do believe there is an issue...will dig in a >> bit but obviously if you make more discoveries here please share. >> >> Thanks >> Joe >> >> >> >> On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser >> wrote: >> > Devs, >> > >> > We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster. We noticed >> > half of our cluster nodes getting "too many open files" errors that >> require >> > a NiFi restart, while the other half works without this problem. Using >> > 'lsof -p ' to identify the open file descriptors at the time of the >> > problem, we see most of the file descriptors reference deleted files in >> the >> > content repository like this: >> > >> > java... /content_repository/81/123456789-123456 >> (deleted) >> > >> > A 'ls /content_repository/81/123456789-123456' confirms that the file has >> > been deleted. >> > >> > We are continuing our investigation into why some of our nodes have a >> > problem but others don't. Has anyone else seen this? Did anything >> change >> > between 0.4.1 and 0.5.1 related to deleting files from the content >> > repository? >> > >> > Regards, >> > -- Mike >>
Re: NiFi 0.5.1 "too many open files"
Another data point ... we had archiving turned on at first, and then most (but not all) files that lsof reported were /content_repository/0/archive/123456789-123456 (deleted). We turned archiving off, hoping that was related in some way, but it was not. -- Mike On Wed, Apr 27, 2016 at 11:53 AM, Joe Wittwrote: > Mike, > > Definitely does not sound familiar. However, just looked up what you > describe and I do see it. In my case there are only three files but > they are sitting there open for writing by the nifi process and yet > have been deleted. So I do believe there is an issue...will dig in a > bit but obviously if you make more discoveries here please share. > > Thanks > Joe > > > > On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser > wrote: > > Devs, > > > > We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster. We noticed > > half of our cluster nodes getting "too many open files" errors that > require > > a NiFi restart, while the other half works without this problem. Using > > 'lsof -p ' to identify the open file descriptors at the time of the > > problem, we see most of the file descriptors reference deleted files in > the > > content repository like this: > > > > java... /content_repository/81/123456789-123456 > (deleted) > > > > A 'ls /content_repository/81/123456789-123456' confirms that the file has > > been deleted. > > > > We are continuing our investigation into why some of our nodes have a > > problem but others don't. Has anyone else seen this? Did anything > change > > between 0.4.1 and 0.5.1 related to deleting files from the content > > repository? > > > > Regards, > > -- Mike >
Re: NiFi 0.5.1 "too many open files"
Mike, Definitely does not sound familiar. However, just looked up what you describe and I do see it. In my case there are only three files but they are sitting there open for writing by the nifi process and yet have been deleted. So I do believe there is an issue...will dig in a bit but obviously if you make more discoveries here please share. Thanks Joe On Wed, Apr 27, 2016 at 11:31 AM, Michael Moserwrote: > Devs, > > We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster. We noticed > half of our cluster nodes getting "too many open files" errors that require > a NiFi restart, while the other half works without this problem. Using > 'lsof -p ' to identify the open file descriptors at the time of the > problem, we see most of the file descriptors reference deleted files in the > content repository like this: > > java... /content_repository/81/123456789-123456 (deleted) > > A 'ls /content_repository/81/123456789-123456' confirms that the file has > been deleted. > > We are continuing our investigation into why some of our nodes have a > problem but others don't. Has anyone else seen this? Did anything change > between 0.4.1 and 0.5.1 related to deleting files from the content > repository? > > Regards, > -- Mike