Re: NiFi 0.5.1 "too many open files"

2016-04-28 Thread Mark Payne
Mike,

I am looking into this, and I was able to figure out how we could potentially 
archive (and eventually
age off from the archive) a piece of data for which there is still an open file 
handle. Specifically,
I am able to understand how this could happen only when we have a problem 
reading a FlowFile
(such as ContentNotFoundException) during a call to ProcessSession.write() with 
a StreamCallback.

So exactly the situation that you laid out here. I will be looking into how to 
address this.

Do you have any idea what may have caused your ContentNotFoundException to 
occur in the first
place?

Thanks
-Mark


> On Apr 27, 2016, at 5:24 PM, Michael Moser  wrote:
> 
> I found something in the logs on the nodes where I had a problem.  A
> ContentNotFoundException begins occurring on these nodes and after many
> thousands of times we eventually get "too many open files".  Once I do
> surgery on the content repository so that ContentNotFoundException stops
> happening, then lsof tells me that I'm not 'leaking' any more (deleted)
> file descriptors.  I have no idea why the ContentNotFoundException starts
> happening, but it sounds similar to Tony Kurc's "heisenbug" [1] reported
> back in March.
> 
> Here's a stack trace of the ContentNotFoundException
> 
> 2016-04-27 12:00:00,00 ERROR [Timer-Driven Process Thread-1]
> MyCustomProcessor
> org.apache.nifi.processor.exception.MissingFlowFileException: Unable to
> find content for FlowFile
>  at
> o.a.n.controller.repository.StandardProcessSession.handleContentNotFound(StandardProcessSession.java:2349)
>  at
> o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2186)
>  at MyCustomProcessor.onTrigger()
>  at o.a.n.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
> Caused by: o.a.n.controller.repository.ContentNotFoundException: Cound not
> find content for StandardContentClaim
> [resourceClaim=StandardResourceClaim[id=123456789-1, container=default,
> section=1], offset=351144, length=30720]
>  at
> o.a.n.controller.repository.FileSystemRepository.getPath(FileSystemRepository.java:492)
>  at
> o.a.n.controller.repository.FileSystemRepository.read(FileSystemRepository.java:801)
>  at
> o.a.n.controller.repository.StandardProcessSession.getInputStream(StandardProcessSession.java:1743)
>  at
> o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2152)
>  ... common frames omitted
> 
> MyCustomProcessor just calls session.write() with a StreamCallback
> implementation that reads from an InputStream and writes to an OutputStream.
> 
> [1] -
> http://apache-nifi-developer-list.39713.n7.nabble.com/heisenbug-causing-quot-lost-quot-content-claims-td7775.html
> 
> -- Mike
> 
> 
> 
> 
> On Wed, Apr 27, 2016 at 12:15 PM, Joe Witt  wrote:
> 
>> Mike
>> 
>> Ok that is a good data point.  In my case they are all in archive but
>> I do agree that isn't super meaningful because in reality nothing
>> should ever be open for writing in the archive.
>> 
>> If you can and have enough logging on try searching for that first
>> part of the filename in your logs.  Let's see if we can narrow this to
>> a particular type of flow or something.
>> 
>> Thanks
>> Joe
>> 
>> On Wed, Apr 27, 2016 at 12:04 PM, Michael Moser 
>> wrote:
>>> Another data point ... we had archiving turned on at first, and then most
>>> (but not all) files that lsof reported were
>>> 
>>> /content_repository/0/archive/123456789-123456 (deleted).
>>> 
>>> We turned archiving off, hoping that was related in some way, but it was
>>> not.
>>> 
>>> -- Mike
>>> 
>>> 
>>> On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt  wrote:
>>> 
 Mike,
 
 Definitely does not sound familiar.  However, just looked up what you
 describe and I do see it.  In my case there are only three files but
 they are sitting there open for writing by the nifi process and yet
 have been deleted.  So I do believe there is an issue...will dig in a
 bit but obviously if you make more discoveries here please share.
 
 Thanks
 Joe
 
 
 
 On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser 
 wrote:
> Devs,
> 
> We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster.  We
>> noticed
> half of our cluster nodes getting "too many open files" errors that
 require
> a NiFi restart, while the other half works without this problem.
>> Using
> 'lsof -p ' to identify the open file descriptors at the time of
>> the
> problem, we see most of the file descriptors reference deleted files
>> in
 the
> content repository like this:
> 
> java... /content_repository/81/123456789-123456
 (deleted)
> 
> A 'ls /content_repository/81/123456789-123456' confirms that the file
>> has
> been deleted.
> 
> We are continuing our investigation into why some of our nodes have a

Re: NiFi 0.5.1 "too many open files"

2016-04-27 Thread Michael Moser
I found something in the logs on the nodes where I had a problem.  A
ContentNotFoundException begins occurring on these nodes and after many
thousands of times we eventually get "too many open files".  Once I do
surgery on the content repository so that ContentNotFoundException stops
happening, then lsof tells me that I'm not 'leaking' any more (deleted)
file descriptors.  I have no idea why the ContentNotFoundException starts
happening, but it sounds similar to Tony Kurc's "heisenbug" [1] reported
back in March.

Here's a stack trace of the ContentNotFoundException

2016-04-27 12:00:00,00 ERROR [Timer-Driven Process Thread-1]
MyCustomProcessor
org.apache.nifi.processor.exception.MissingFlowFileException: Unable to
find content for FlowFile
  at
o.a.n.controller.repository.StandardProcessSession.handleContentNotFound(StandardProcessSession.java:2349)
  at
o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2186)
  at MyCustomProcessor.onTrigger()
  at o.a.n.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
Caused by: o.a.n.controller.repository.ContentNotFoundException: Cound not
find content for StandardContentClaim
[resourceClaim=StandardResourceClaim[id=123456789-1, container=default,
section=1], offset=351144, length=30720]
  at
o.a.n.controller.repository.FileSystemRepository.getPath(FileSystemRepository.java:492)
  at
o.a.n.controller.repository.FileSystemRepository.read(FileSystemRepository.java:801)
  at
o.a.n.controller.repository.StandardProcessSession.getInputStream(StandardProcessSession.java:1743)
  at
o.a.n.controller.repository.StandardProcessSession.write(StandardProcessSession.java:2152)
  ... common frames omitted

MyCustomProcessor just calls session.write() with a StreamCallback
implementation that reads from an InputStream and writes to an OutputStream.

[1] -
http://apache-nifi-developer-list.39713.n7.nabble.com/heisenbug-causing-quot-lost-quot-content-claims-td7775.html

-- Mike




On Wed, Apr 27, 2016 at 12:15 PM, Joe Witt  wrote:

> Mike
>
> Ok that is a good data point.  In my case they are all in archive but
> I do agree that isn't super meaningful because in reality nothing
> should ever be open for writing in the archive.
>
> If you can and have enough logging on try searching for that first
> part of the filename in your logs.  Let's see if we can narrow this to
> a particular type of flow or something.
>
> Thanks
> Joe
>
> On Wed, Apr 27, 2016 at 12:04 PM, Michael Moser 
> wrote:
> > Another data point ... we had archiving turned on at first, and then most
> > (but not all) files that lsof reported were
> >
> > /content_repository/0/archive/123456789-123456 (deleted).
> >
> > We turned archiving off, hoping that was related in some way, but it was
> > not.
> >
> > -- Mike
> >
> >
> > On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt  wrote:
> >
> >> Mike,
> >>
> >> Definitely does not sound familiar.  However, just looked up what you
> >> describe and I do see it.  In my case there are only three files but
> >> they are sitting there open for writing by the nifi process and yet
> >> have been deleted.  So I do believe there is an issue...will dig in a
> >> bit but obviously if you make more discoveries here please share.
> >>
> >> Thanks
> >> Joe
> >>
> >>
> >>
> >> On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser 
> >> wrote:
> >> > Devs,
> >> >
> >> > We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster.  We
> noticed
> >> > half of our cluster nodes getting "too many open files" errors that
> >> require
> >> > a NiFi restart, while the other half works without this problem.
> Using
> >> > 'lsof -p ' to identify the open file descriptors at the time of
> the
> >> > problem, we see most of the file descriptors reference deleted files
> in
> >> the
> >> > content repository like this:
> >> >
> >> > java... /content_repository/81/123456789-123456
> >> (deleted)
> >> >
> >> > A 'ls /content_repository/81/123456789-123456' confirms that the file
> has
> >> > been deleted.
> >> >
> >> > We are continuing our investigation into why some of our nodes have a
> >> > problem but others don't.  Has anyone else seen this?  Did anything
> >> change
> >> > between 0.4.1 and 0.5.1 related to deleting files from the content
> >> > repository?
> >> >
> >> > Regards,
> >> > -- Mike
> >>
>


Re: NiFi 0.5.1 "too many open files"

2016-04-27 Thread Joe Witt
Mike

Ok that is a good data point.  In my case they are all in archive but
I do agree that isn't super meaningful because in reality nothing
should ever be open for writing in the archive.

If you can and have enough logging on try searching for that first
part of the filename in your logs.  Let's see if we can narrow this to
a particular type of flow or something.

Thanks
Joe

On Wed, Apr 27, 2016 at 12:04 PM, Michael Moser  wrote:
> Another data point ... we had archiving turned on at first, and then most
> (but not all) files that lsof reported were
>
> /content_repository/0/archive/123456789-123456 (deleted).
>
> We turned archiving off, hoping that was related in some way, but it was
> not.
>
> -- Mike
>
>
> On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt  wrote:
>
>> Mike,
>>
>> Definitely does not sound familiar.  However, just looked up what you
>> describe and I do see it.  In my case there are only three files but
>> they are sitting there open for writing by the nifi process and yet
>> have been deleted.  So I do believe there is an issue...will dig in a
>> bit but obviously if you make more discoveries here please share.
>>
>> Thanks
>> Joe
>>
>>
>>
>> On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser 
>> wrote:
>> > Devs,
>> >
>> > We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster.  We noticed
>> > half of our cluster nodes getting "too many open files" errors that
>> require
>> > a NiFi restart, while the other half works without this problem.  Using
>> > 'lsof -p ' to identify the open file descriptors at the time of the
>> > problem, we see most of the file descriptors reference deleted files in
>> the
>> > content repository like this:
>> >
>> > java... /content_repository/81/123456789-123456
>> (deleted)
>> >
>> > A 'ls /content_repository/81/123456789-123456' confirms that the file has
>> > been deleted.
>> >
>> > We are continuing our investigation into why some of our nodes have a
>> > problem but others don't.  Has anyone else seen this?  Did anything
>> change
>> > between 0.4.1 and 0.5.1 related to deleting files from the content
>> > repository?
>> >
>> > Regards,
>> > -- Mike
>>


Re: NiFi 0.5.1 "too many open files"

2016-04-27 Thread Michael Moser
Another data point ... we had archiving turned on at first, and then most
(but not all) files that lsof reported were

/content_repository/0/archive/123456789-123456 (deleted).

We turned archiving off, hoping that was related in some way, but it was
not.

-- Mike


On Wed, Apr 27, 2016 at 11:53 AM, Joe Witt  wrote:

> Mike,
>
> Definitely does not sound familiar.  However, just looked up what you
> describe and I do see it.  In my case there are only three files but
> they are sitting there open for writing by the nifi process and yet
> have been deleted.  So I do believe there is an issue...will dig in a
> bit but obviously if you make more discoveries here please share.
>
> Thanks
> Joe
>
>
>
> On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser 
> wrote:
> > Devs,
> >
> > We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster.  We noticed
> > half of our cluster nodes getting "too many open files" errors that
> require
> > a NiFi restart, while the other half works without this problem.  Using
> > 'lsof -p ' to identify the open file descriptors at the time of the
> > problem, we see most of the file descriptors reference deleted files in
> the
> > content repository like this:
> >
> > java... /content_repository/81/123456789-123456
> (deleted)
> >
> > A 'ls /content_repository/81/123456789-123456' confirms that the file has
> > been deleted.
> >
> > We are continuing our investigation into why some of our nodes have a
> > problem but others don't.  Has anyone else seen this?  Did anything
> change
> > between 0.4.1 and 0.5.1 related to deleting files from the content
> > repository?
> >
> > Regards,
> > -- Mike
>


Re: NiFi 0.5.1 "too many open files"

2016-04-27 Thread Joe Witt
Mike,

Definitely does not sound familiar.  However, just looked up what you
describe and I do see it.  In my case there are only three files but
they are sitting there open for writing by the nifi process and yet
have been deleted.  So I do believe there is an issue...will dig in a
bit but obviously if you make more discoveries here please share.

Thanks
Joe



On Wed, Apr 27, 2016 at 11:31 AM, Michael Moser  wrote:
> Devs,
>
> We recently upgraded from NiFi 0.4.1 to 0.5.1 on a cluster.  We noticed
> half of our cluster nodes getting "too many open files" errors that require
> a NiFi restart, while the other half works without this problem.  Using
> 'lsof -p ' to identify the open file descriptors at the time of the
> problem, we see most of the file descriptors reference deleted files in the
> content repository like this:
>
> java... /content_repository/81/123456789-123456 (deleted)
>
> A 'ls /content_repository/81/123456789-123456' confirms that the file has
> been deleted.
>
> We are continuing our investigation into why some of our nodes have a
> problem but others don't.  Has anyone else seen this?  Did anything change
> between 0.4.1 and 0.5.1 related to deleting files from the content
> repository?
>
> Regards,
> -- Mike