Re: Open socket handles build up over time (leaking?)

2022-05-13 Thread Leon Finker
Hi,

I'm trying to find the other slow client socket leak. Is it possible
that the socket sometimes not closed for this condition?
[warn  tid=324]
ClientHealthMonitor: Unregistering client with member id
...,connection=1,durableAttributes=DurableClientAttributes[id=xxx;
timeout=86400]) due to: The connection has been reset while reading
the header

It seems to come from Message.fetchHeader. But I'm not sure on the
usage to review properly if that socket is always closed when the
client side does EOF while reading the header from the socket. Could
someone help review? From my count of that log warning to the metric
of leaked socket handles, it seems to correlate.

Thank you in advance

On Mon, Nov 22, 2021 at 1:45 PM Darrel Schneider  wrote:
>
> Thanks for filing the ticket and for looking into the additional leak
> 
> From: Leon Finker 
> Sent: Monday, November 22, 2021 10:19 AM
> To: dev@geode.apache.org 
> Subject: Re: Open socket handles build up over time (leaking?)
>
> Hi,
>
> Bug ticket: 
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9819data=04%7C01%7Cdarrel%40vmware.com%7C8015b9834b2845b645a108d9ade4ab34%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637732019908748063%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=Ngv07U56igvyPoQHxXEtZXFGxEtUVcyJKde%2F1N2gAlQ%3Dreserved=0
>
> I applied the fix locally and now this leak is fixed. Thank you!
>
> There is still a minor leak of about 10 handles per day that I want to
> track down next week. From the looks of it it's not as simple to track
> as this one :( Still happens in the durable client connection use case
> and cache server port. Seems to be when an existing durable connection
> is disconnected and automatically reconnected. I need to do more
> testing first to know for sure.
>
> On Wed, Nov 17, 2021 at 4:22 PM Darrel Schneider  wrote:
> >
> > Yes, open a geode bug ticket
> > 
> > From: Leon Finker 
> > Sent: Wednesday, November 17, 2021 1:13 PM
> > To: dev@geode.apache.org 
> > Subject: Re: Open socket handles build up over time (leaking?)
> >
> > Thank you! I'll try to test this change. What's the procedure for
> > this? Should I open geode bug ticket?
> >
> > On Wed, Nov 17, 2021 at 2:22 PM Darrel Schneider  wrote:
> > >
> > > I think you found the leak!
> > > My understanding of the code in registerClientInternal (I'm looking at 
> > > the current develop branch) is that when it logs the warning "Duplicate 
> > > durable clients are not allowed" that it considers the current client 
> > > connect attempt to have failed. It writes this response back to it: 
> > > REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to 
> > > throw ServerRefusedConnectionException. What seems wrong about this 
> > > method is that even though it sets "unsuccessfulMsg" and correctly sends 
> > > back a handshake saying the client is rejected, it does not throw an 
> > > exception and it does not close "socket". I think right before it calls 
> > > performPostAuthorization it should do the followiing:
> > > if (unsuccessfulMsg != null) {
> > >   try {
> > > socket.close();
> > >   } catch (IOException ignore) {
> > >   }
> > >  } else {
> > > performPostAuthorization(...)
> > > }
> > > 
> > > From: Leon Finker 
> > > Sent: Wednesday, November 17, 2021 10:30 AM
> > > To: dev@geode.apache.org 
> > > Subject: Re: Open socket handles build up over time (leaking?)
> > >
> > > Following Darrel's excellent advice :) I think I tracked down the area
> > > of the socket handle leak. As the article suggested, I ran the lsof
> > > capture every 5 minutes. I then traced back the cleaned up socket
> > > handles to the valid lsof entries. I verified a bunch of them and they
> > > all ended up pointing to the same durable connection cases that fail
> > > as follows:
> > >
> > > [2021-11-17 14:05:13,081Z][info  > > 17> tid=32115] :Cache server: Initializing secondary server-to-client
> > > communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
> > > [2021-11-17 14:05:13,083Z][warn  > > 17> tid=32115] The requested durable client has the same identifier (
> > > tpdemo2@x_gem_x ) as an existing durable client (
> > > CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243

Re: Open socket handles build up over time (leaking?)

2021-12-03 Thread Leon Finker
Hi,

Unfortunately I'm not able to find the cause of the remaining minor
socket handle leak.I've tried tracing it for a week. It won't cause us
any major issues since it leaks about 10 handles per day on average.
For example, after 15 days, there are only 160 socket handles leaked
on one cache peer and on the other one 120. It is definitely for the
cache port and durable client connections. Only those show up in the
lsof captures as problematic.

I tried to capture lsof and enabled debug logging in geode to be able
to correlate ports and maybe get some additional information for that
connection. But there is nothing in the log for this. I've correlated
about 10 of these CLOSE_WAIT and then abandoned sockets and in all
cases geode at debug says the following for those ports as very last
thing:

[2021-12-03 10:20:01,056Z][debug  tid=74] Checking whether to
ping 
CacheClientProxy[identity(192.168.66.7(xxx:loner):65188:000fae7f:xxx(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=xxx_gem_Amoeba;
timeout=86400]); port=59084; primary=true; version=GEODE 1.14.0]

[2021-12-03 10:20:01,056Z][debug  tid=74] Not pinging because
not idle: 
CacheClientProxy[identity(192.168.66.7(xxx:loner):65188:000fae7f:xxx(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=xx_gem_Amoeba;
timeout=86400]); port=59084; primary=true; version=GEODE 1.14.0]

(I replaced durable id details with xxx above). But after the above
lines, there is no longer any mention of port=59084 or :59084. Nothing
after hours or days. I can see the same user/durable id reconnect
after some time but on different port. The port that is lost has no
activity in the logs ever again.

I don't know how to debug it further. The other leak was easy compared
to this since there was an exception in the log and some activity for
the port so was easy to correlate.

But just to confirm the fix for the other leak is working perfectly
fine. After 2 weeks, all good!

Thank you very much!

On Mon, Nov 22, 2021 at 1:45 PM Darrel Schneider  wrote:
>
> Thanks for filing the ticket and for looking into the additional leak
> 
> From: Leon Finker 
> Sent: Monday, November 22, 2021 10:19 AM
> To: dev@geode.apache.org 
> Subject: Re: Open socket handles build up over time (leaking?)
>
> Hi,
>
> Bug ticket: 
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9819data=04%7C01%7Cdarrel%40vmware.com%7C8015b9834b2845b645a108d9ade4ab34%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637732019908748063%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=Ngv07U56igvyPoQHxXEtZXFGxEtUVcyJKde%2F1N2gAlQ%3Dreserved=0
>
> I applied the fix locally and now this leak is fixed. Thank you!
>
> There is still a minor leak of about 10 handles per day that I want to
> track down next week. From the looks of it it's not as simple to track
> as this one :( Still happens in the durable client connection use case
> and cache server port. Seems to be when an existing durable connection
> is disconnected and automatically reconnected. I need to do more
> testing first to know for sure.
>
> On Wed, Nov 17, 2021 at 4:22 PM Darrel Schneider  wrote:
> >
> > Yes, open a geode bug ticket
> > 
> > From: Leon Finker 
> > Sent: Wednesday, November 17, 2021 1:13 PM
> > To: dev@geode.apache.org 
> > Subject: Re: Open socket handles build up over time (leaking?)
> >
> > Thank you! I'll try to test this change. What's the procedure for
> > this? Should I open geode bug ticket?
> >
> > On Wed, Nov 17, 2021 at 2:22 PM Darrel Schneider  wrote:
> > >
> > > I think you found the leak!
> > > My understanding of the code in registerClientInternal (I'm looking at 
> > > the current develop branch) is that when it logs the warning "Duplicate 
> > > durable clients are not allowed" that it considers the current client 
> > > connect attempt to have failed. It writes this response back to it: 
> > > REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to 
> > > throw ServerRefusedConnectionException. What seems wrong about this 
> > > method is that even though it sets "unsuccessfulMsg" and correctly sends 
> > > back a handshake saying the client is rejected, it does not throw an 
> > > exception and it does not close "socket". I think right before it calls 
> > > performPostAuthorization it should do the followiing:
> > > if (unsuccessfulMsg != null) {
> > >   try {
> > > socket.close();
> > >   } catch (IOException ignore) {
> > >   }
> > > 

Re: Open socket handles build up over time (leaking?)

2021-11-22 Thread Darrel Schneider
Thanks for filing the ticket and for looking into the additional leak

From: Leon Finker 
Sent: Monday, November 22, 2021 10:19 AM
To: dev@geode.apache.org 
Subject: Re: Open socket handles build up over time (leaking?)

Hi,

Bug ticket: 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9819data=04%7C01%7Cdarrel%40vmware.com%7C8015b9834b2845b645a108d9ade4ab34%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637732019908748063%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=Ngv07U56igvyPoQHxXEtZXFGxEtUVcyJKde%2F1N2gAlQ%3Dreserved=0

I applied the fix locally and now this leak is fixed. Thank you!

There is still a minor leak of about 10 handles per day that I want to
track down next week. From the looks of it it's not as simple to track
as this one :( Still happens in the durable client connection use case
and cache server port. Seems to be when an existing durable connection
is disconnected and automatically reconnected. I need to do more
testing first to know for sure.

On Wed, Nov 17, 2021 at 4:22 PM Darrel Schneider  wrote:
>
> Yes, open a geode bug ticket
> 
> From: Leon Finker 
> Sent: Wednesday, November 17, 2021 1:13 PM
> To: dev@geode.apache.org 
> Subject: Re: Open socket handles build up over time (leaking?)
>
> Thank you! I'll try to test this change. What's the procedure for
> this? Should I open geode bug ticket?
>
> On Wed, Nov 17, 2021 at 2:22 PM Darrel Schneider  wrote:
> >
> > I think you found the leak!
> > My understanding of the code in registerClientInternal (I'm looking at the 
> > current develop branch) is that when it logs the warning "Duplicate durable 
> > clients are not allowed" that it considers the current client connect 
> > attempt to have failed. It writes this response back to it: 
> > REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to 
> > throw ServerRefusedConnectionException. What seems wrong about this method 
> > is that even though it sets "unsuccessfulMsg" and correctly sends back a 
> > handshake saying the client is rejected, it does not throw an exception and 
> > it does not close "socket". I think right before it calls 
> > performPostAuthorization it should do the followiing:
> > if (unsuccessfulMsg != null) {
> >   try {
> > socket.close();
> >   } catch (IOException ignore) {
> >   }
> >  } else {
> > performPostAuthorization(...)
> > }
> > ____
> > From: Leon Finker 
> > Sent: Wednesday, November 17, 2021 10:30 AM
> > To: dev@geode.apache.org 
> > Subject: Re: Open socket handles build up over time (leaking?)
> >
> > Following Darrel's excellent advice :) I think I tracked down the area
> > of the socket handle leak. As the article suggested, I ran the lsof
> > capture every 5 minutes. I then traced back the cleaned up socket
> > handles to the valid lsof entries. I verified a bunch of them and they
> > all ended up pointing to the same durable connection cases that fail
> > as follows:
> >
> > [2021-11-17 14:05:13,081Z][info  > 17> tid=32115] :Cache server: Initializing secondary server-to-client
> > communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
> > [2021-11-17 14:05:13,083Z][warn  > 17> tid=32115] The requested durable client has the same identifier (
> > tpdemo2@x_gem_x ) as an existing durable client (
> > CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> > timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
> > Duplicate durable clients are not allowed.
> > [2021-11-17 14:05:13,084Z][warn  > 17> tid=32115] CacheClientNotifier: Unsuccessfully registered client
> > with identifier
> > identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> > timeout=86400]) and response code 64
> > [2021-11-17 14:05:13,098Z][warn  > 34> tid=290] Server connection from
> > [identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> > timeout=86400]); port=60242]: connection disconnect detected by EOF.
> >
> > I then counted the number of those errors and found t

Re: Open socket handles build up over time (leaking?)

2021-11-22 Thread Leon Finker
Hi,

Bug ticket: https://issues.apache.org/jira/browse/GEODE-9819

I applied the fix locally and now this leak is fixed. Thank you!

There is still a minor leak of about 10 handles per day that I want to
track down next week. From the looks of it it's not as simple to track
as this one :( Still happens in the durable client connection use case
and cache server port. Seems to be when an existing durable connection
is disconnected and automatically reconnected. I need to do more
testing first to know for sure.

On Wed, Nov 17, 2021 at 4:22 PM Darrel Schneider  wrote:
>
> Yes, open a geode bug ticket
> 
> From: Leon Finker 
> Sent: Wednesday, November 17, 2021 1:13 PM
> To: dev@geode.apache.org 
> Subject: Re: Open socket handles build up over time (leaking?)
>
> Thank you! I'll try to test this change. What's the procedure for
> this? Should I open geode bug ticket?
>
> On Wed, Nov 17, 2021 at 2:22 PM Darrel Schneider  wrote:
> >
> > I think you found the leak!
> > My understanding of the code in registerClientInternal (I'm looking at the 
> > current develop branch) is that when it logs the warning "Duplicate durable 
> > clients are not allowed" that it considers the current client connect 
> > attempt to have failed. It writes this response back to it: 
> > REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to 
> > throw ServerRefusedConnectionException. What seems wrong about this method 
> > is that even though it sets "unsuccessfulMsg" and correctly sends back a 
> > handshake saying the client is rejected, it does not throw an exception and 
> > it does not close "socket". I think right before it calls 
> > performPostAuthorization it should do the followiing:
> > if (unsuccessfulMsg != null) {
> >   try {
> > socket.close();
> >   } catch (IOException ignore) {
> >   }
> >  } else {
> > performPostAuthorization(...)
> > }
> > ____
> > From: Leon Finker 
> > Sent: Wednesday, November 17, 2021 10:30 AM
> > To: dev@geode.apache.org 
> > Subject: Re: Open socket handles build up over time (leaking?)
> >
> > Following Darrel's excellent advice :) I think I tracked down the area
> > of the socket handle leak. As the article suggested, I ran the lsof
> > capture every 5 minutes. I then traced back the cleaned up socket
> > handles to the valid lsof entries. I verified a bunch of them and they
> > all ended up pointing to the same durable connection cases that fail
> > as follows:
> >
> > [2021-11-17 14:05:13,081Z][info  > 17> tid=32115] :Cache server: Initializing secondary server-to-client
> > communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
> > [2021-11-17 14:05:13,083Z][warn  > 17> tid=32115] The requested durable client has the same identifier (
> > tpdemo2@x_gem_x ) as an existing durable client (
> > CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> > timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
> > Duplicate durable clients are not allowed.
> > [2021-11-17 14:05:13,084Z][warn  > 17> tid=32115] CacheClientNotifier: Unsuccessfully registered client
> > with identifier
> > identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> > timeout=86400]) and response code 64
> > [2021-11-17 14:05:13,098Z][warn  > 34> tid=290] Server connection from
> > [identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> > timeout=86400]); port=60242]: connection disconnect detected by EOF.
> >
> > I then counted the number of those errors and found that the
> > difference is exactly the same as the leaked socket handles. So
> > everything points to this area in the server code. But I tried to
> > reproduce this in debugger and stepped through the geode code without
> > finding where the socket might be leaked. Put breakpoint on
> > CacheClientNotifier.registerClientInternal and the line where that
> > error happens. But then again the exception seems to bubble up to
> > AcceptorImpl.run where the is catch for IOException. And in case of
> > IOException, the client socket is closed. Is this the right so

Re: Open socket handles build up over time (leaking?)

2021-11-17 Thread Darrel Schneider
Yes, open a geode bug ticket

From: Leon Finker 
Sent: Wednesday, November 17, 2021 1:13 PM
To: dev@geode.apache.org 
Subject: Re: Open socket handles build up over time (leaking?)

Thank you! I'll try to test this change. What's the procedure for
this? Should I open geode bug ticket?

On Wed, Nov 17, 2021 at 2:22 PM Darrel Schneider  wrote:
>
> I think you found the leak!
> My understanding of the code in registerClientInternal (I'm looking at the 
> current develop branch) is that when it logs the warning "Duplicate durable 
> clients are not allowed" that it considers the current client connect attempt 
> to have failed. It writes this response back to it: 
> REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to throw 
> ServerRefusedConnectionException. What seems wrong about this method is that 
> even though it sets "unsuccessfulMsg" and correctly sends back a handshake 
> saying the client is rejected, it does not throw an exception and it does not 
> close "socket". I think right before it calls performPostAuthorization it 
> should do the followiing:
> if (unsuccessfulMsg != null) {
>   try {
> socket.close();
>   } catch (IOException ignore) {
>   }
>  } else {
> performPostAuthorization(...)
> }
> 
> From: Leon Finker 
> Sent: Wednesday, November 17, 2021 10:30 AM
> To: dev@geode.apache.org 
> Subject: Re: Open socket handles build up over time (leaking?)
>
> Following Darrel's excellent advice :) I think I tracked down the area
> of the socket handle leak. As the article suggested, I ran the lsof
> capture every 5 minutes. I then traced back the cleaned up socket
> handles to the valid lsof entries. I verified a bunch of them and they
> all ended up pointing to the same durable connection cases that fail
> as follows:
>
> [2021-11-17 14:05:13,081Z][info  17> tid=32115] :Cache server: Initializing secondary server-to-client
> communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
> [2021-11-17 14:05:13,083Z][warn  17> tid=32115] The requested durable client has the same identifier (
> tpdemo2@x_gem_x ) as an existing durable client (
> CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
> Duplicate durable clients are not allowed.
> [2021-11-17 14:05:13,084Z][warn  17> tid=32115] CacheClientNotifier: Unsuccessfully registered client
> with identifier
> identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]) and response code 64
> [2021-11-17 14:05:13,098Z][warn  34> tid=290] Server connection from
> [identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]); port=60242]: connection disconnect detected by EOF.
>
> I then counted the number of those errors and found that the
> difference is exactly the same as the leaked socket handles. So
> everything points to this area in the server code. But I tried to
> reproduce this in debugger and stepped through the geode code without
> finding where the socket might be leaked. Put breakpoint on
> CacheClientNotifier.registerClientInternal and the line where that
> error happens. But then again the exception seems to bubble up to
> AcceptorImpl.run where the is catch for IOException. And in case of
> IOException, the client socket is closed. Is this the right socket I'm
> after?
>
> Does this make it any clearer for someone who is more familiar with
> the code? NOTE: the duplicate durable client happens sometimes due to
> race conditions with reconnects (I think). It's not happening all the
> time. It's not causing any problems to the client in general. But the
> server leak eventually causes us to run out of file handles for the
> process :(
>
> Thank you!
>
>
> On Tue, Nov 16, 2021 at 4:33 PM Leon Finker  wrote:
> >
> > Hi Darrel,
> >
> > Thank you! I'll try to track it!
> >
> > On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider  wrote:
> > >
> > > This link: 
> > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsofdata=04%7C01%7Cdarrel%40vmware.com%7C

Re: Open socket handles build up over time (leaking?)

2021-11-17 Thread Leon Finker
Thank you! I'll try to test this change. What's the procedure for
this? Should I open geode bug ticket?

On Wed, Nov 17, 2021 at 2:22 PM Darrel Schneider  wrote:
>
> I think you found the leak!
> My understanding of the code in registerClientInternal (I'm looking at the 
> current develop branch) is that when it logs the warning "Duplicate durable 
> clients are not allowed" that it considers the current client connect attempt 
> to have failed. It writes this response back to it: 
> REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to throw 
> ServerRefusedConnectionException. What seems wrong about this method is that 
> even though it sets "unsuccessfulMsg" and correctly sends back a handshake 
> saying the client is rejected, it does not throw an exception and it does not 
> close "socket". I think right before it calls performPostAuthorization it 
> should do the followiing:
> if (unsuccessfulMsg != null) {
>   try {
> socket.close();
>   } catch (IOException ignore) {
>   }
>  } else {
> performPostAuthorization(...)
> }
> 
> From: Leon Finker 
> Sent: Wednesday, November 17, 2021 10:30 AM
> To: dev@geode.apache.org 
> Subject: Re: Open socket handles build up over time (leaking?)
>
> Following Darrel's excellent advice :) I think I tracked down the area
> of the socket handle leak. As the article suggested, I ran the lsof
> capture every 5 minutes. I then traced back the cleaned up socket
> handles to the valid lsof entries. I verified a bunch of them and they
> all ended up pointing to the same durable connection cases that fail
> as follows:
>
> [2021-11-17 14:05:13,081Z][info  17> tid=32115] :Cache server: Initializing secondary server-to-client
> communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
> [2021-11-17 14:05:13,083Z][warn  17> tid=32115] The requested durable client has the same identifier (
> tpdemo2@x_gem_x ) as an existing durable client (
> CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
> Duplicate durable clients are not allowed.
> [2021-11-17 14:05:13,084Z][warn  17> tid=32115] CacheClientNotifier: Unsuccessfully registered client
> with identifier
> identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]) and response code 64
> [2021-11-17 14:05:13,098Z][warn  34> tid=290] Server connection from
> [identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]); port=60242]: connection disconnect detected by EOF.
>
> I then counted the number of those errors and found that the
> difference is exactly the same as the leaked socket handles. So
> everything points to this area in the server code. But I tried to
> reproduce this in debugger and stepped through the geode code without
> finding where the socket might be leaked. Put breakpoint on
> CacheClientNotifier.registerClientInternal and the line where that
> error happens. But then again the exception seems to bubble up to
> AcceptorImpl.run where the is catch for IOException. And in case of
> IOException, the client socket is closed. Is this the right socket I'm
> after?
>
> Does this make it any clearer for someone who is more familiar with
> the code? NOTE: the duplicate durable client happens sometimes due to
> race conditions with reconnects (I think). It's not happening all the
> time. It's not causing any problems to the client in general. But the
> server leak eventually causes us to run out of file handles for the
> process :(
>
> Thank you!
>
>
> On Tue, Nov 16, 2021 at 4:33 PM Leon Finker  wrote:
> >
> > Hi Darrel,
> >
> > Thank you! I'll try to track it!
> >
> > On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider  wrote:
> > >
> > > This link: 
> > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsofdata=04%7C01%7Cdarrel%40vmware.com%7Cc51c566dbdb94f54bdf608d9a9f85987%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727706394598508%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=AOWZs7WTSsQ8U

Re: Open socket handles build up over time (leaking?)

2021-11-17 Thread Leon Finker
Yes definitely not sharing a durable ID between users. Durable ID has
login id@host and some other attributes. Even on the same machine if a
user tries to run the 2nd instance of the client application, then a
unique ID will be generated and appended to the durable ID.

It seems the durable session cleans up by itself on server side and
then same durable id can connect again as if nothing happened
* Very mysterious with the "due to: Unknown reason"...

[2021-11-17 14:08:25,614Z][warn  tid=216]
ClientHealthMonitor: Unregistering client with member id
identity(x(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@x_gem_x;
timeout=86400]) due to: Unknown reason
[2021-11-17 14:08:25,614Z][info  tid=30130] available ids = 5 , isEmptyAckList
=true, peekInitialized = true
[2021-11-17 14:08:25,614Z][info  tid=30130]
CacheClientProxy[identity(x(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@x_gem_x;
timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] :
Pausing processing
[2021-11-17 14:08:25,629Z][info  tid=216]
CacheClientNotifier: Keeping proxy for durable client named
tpdemo2@x_gem_x for 86400 seconds
CacheClientProxy[identity(x(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@x_gem_x;
timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0].
[2021-11-17 14:08:25,637Z][warn  tid=216]
Monitoring client with member id
identity(x(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@x_gem_x;
timeout=86400]). It had been 60310 ms since the latest heartbeat. Max
interval is 6. Terminated client.




On Wed, Nov 17, 2021 at 3:07 PM Anthony Baker  wrote:
>
> I’m curious - how do you assign a durable client id?  The id should be unique 
> to a client and not shared among a pool of clients.
>
> Anthony
>
>
> > On Nov 17, 2021, at 11:22 AM, Darrel Schneider  wrote:
> >
> > I think you found the leak!
> > My understanding of the code in registerClientInternal (I'm looking at the 
> > current develop branch) is that when it logs the warning "Duplicate durable 
> > clients are not allowed" that it considers the current client connect 
> > attempt to have failed. It writes this response back to it: 
> > REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to 
> > throw ServerRefusedConnectionException. What seems wrong about this method 
> > is that even though it sets "unsuccessfulMsg" and correctly sends back a 
> > handshake saying the client is rejected, it does not throw an exception and 
> > it does not close "socket". I think right before it calls 
> > performPostAuthorization it should do the followiing:
> > if (unsuccessfulMsg != null) {
> >  try {
> >socket.close();
> >  } catch (IOException ignore) {
> >  }
> > } else {
> >performPostAuthorization(...)
> > }
> > ____
> > From: Leon Finker 
> > Sent: Wednesday, November 17, 2021 10:30 AM
> > To: dev@geode.apache.org 
> > Subject: Re: Open socket handles build up over time (leaking?)
> >
> > Following Darrel's excellent advice :) I think I tracked down the area
> > of the socket handle leak. As the article suggested, I ran the lsof
> > capture every 5 minutes. I then traced back the cleaned up socket
> > handles to the valid lsof entries. I verified a bunch of them and they
> > all ended up pointing to the same durable connection cases that fail
> > as follows:
> >
> > [2021-11-17 14:05:13,081Z][info  > 17> tid=32115] :Cache server: Initializing secondary server-to-client
> > communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
> > [2021-11-17 14:05:13,083Z][warn  > 17> tid=32115] The requested durable client has the same identifier (
> > tpdemo2@x_gem_x ) as an existing durable client (
> > CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> > timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
> > Duplicate durable clients are not allowed.
> > [2021-11-17 14:05:13,084Z][warn  > 17> tid=32115] CacheClientNotifier: Unsuccessfully registered client
> > with identifier
> > identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2

Re: Open socket handles build up over time (leaking?)

2021-11-17 Thread Anthony Baker
I’m curious - how do you assign a durable client id?  The id should be unique 
to a client and not shared among a pool of clients. 

Anthony


> On Nov 17, 2021, at 11:22 AM, Darrel Schneider  wrote:
> 
> I think you found the leak!
> My understanding of the code in registerClientInternal (I'm looking at the 
> current develop branch) is that when it logs the warning "Duplicate durable 
> clients are not allowed" that it considers the current client connect attempt 
> to have failed. It writes this response back to it: 
> REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to throw 
> ServerRefusedConnectionException. What seems wrong about this method is that 
> even though it sets "unsuccessfulMsg" and correctly sends back a handshake 
> saying the client is rejected, it does not throw an exception and it does not 
> close "socket". I think right before it calls performPostAuthorization it 
> should do the followiing:
> if (unsuccessfulMsg != null) {
>  try {
>socket.close();
>  } catch (IOException ignore) {
>  }
> } else {
>performPostAuthorization(...)
> }
> 
> From: Leon Finker 
> Sent: Wednesday, November 17, 2021 10:30 AM
> To: dev@geode.apache.org 
> Subject: Re: Open socket handles build up over time (leaking?)
> 
> Following Darrel's excellent advice :) I think I tracked down the area
> of the socket handle leak. As the article suggested, I ran the lsof
> capture every 5 minutes. I then traced back the cleaned up socket
> handles to the valid lsof entries. I verified a bunch of them and they
> all ended up pointing to the same durable connection cases that fail
> as follows:
> 
> [2021-11-17 14:05:13,081Z][info  17> tid=32115] :Cache server: Initializing secondary server-to-client
> communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
> [2021-11-17 14:05:13,083Z][warn  17> tid=32115] The requested durable client has the same identifier (
> tpdemo2@x_gem_x ) as an existing durable client (
> CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
> Duplicate durable clients are not allowed.
> [2021-11-17 14:05:13,084Z][warn  17> tid=32115] CacheClientNotifier: Unsuccessfully registered client
> with identifier
> identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]) and response code 64
> [2021-11-17 14:05:13,098Z][warn  34> tid=290] Server connection from
> [identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> timeout=86400]); port=60242]: connection disconnect detected by EOF.
> 
> I then counted the number of those errors and found that the
> difference is exactly the same as the leaked socket handles. So
> everything points to this area in the server code. But I tried to
> reproduce this in debugger and stepped through the geode code without
> finding where the socket might be leaked. Put breakpoint on
> CacheClientNotifier.registerClientInternal and the line where that
> error happens. But then again the exception seems to bubble up to
> AcceptorImpl.run where the is catch for IOException. And in case of
> IOException, the client socket is closed. Is this the right socket I'm
> after?
> 
> Does this make it any clearer for someone who is more familiar with
> the code? NOTE: the duplicate durable client happens sometimes due to
> race conditions with reconnects (I think). It's not happening all the
> time. It's not causing any problems to the client in general. But the
> server leak eventually causes us to run out of file handles for the
> process :(
> 
> Thank you!
> 
> 
> On Tue, Nov 16, 2021 at 4:33 PM Leon Finker  wrote:
>> 
>> Hi Darrel,
>> 
>> Thank you! I'll try to track it!
>> 
>> On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider  wrote:
>>> 
>>> This link: 
>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsofdata=04%7C01%7Cbakera%40vmware.com%7C3b69c55afcfa4eabc50508d9a9ffa1a4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727737664517186%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXV

Re: Open socket handles build up over time (leaking?)

2021-11-17 Thread Darrel Schneider
I think you found the leak!
My understanding of the code in registerClientInternal (I'm looking at the 
current develop branch) is that when it logs the warning "Duplicate durable 
clients are not allowed" that it considers the current client connect attempt 
to have failed. It writes this response back to it: 
REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to throw 
ServerRefusedConnectionException. What seems wrong about this method is that 
even though it sets "unsuccessfulMsg" and correctly sends back a handshake 
saying the client is rejected, it does not throw an exception and it does not 
close "socket". I think right before it calls performPostAuthorization it 
should do the followiing:
if (unsuccessfulMsg != null) {
  try {
socket.close();
  } catch (IOException ignore) {
  }
 } else {
performPostAuthorization(...)
}

From: Leon Finker 
Sent: Wednesday, November 17, 2021 10:30 AM
To: dev@geode.apache.org 
Subject: Re: Open socket handles build up over time (leaking?)

Following Darrel's excellent advice :) I think I tracked down the area
of the socket handle leak. As the article suggested, I ran the lsof
capture every 5 minutes. I then traced back the cleaned up socket
handles to the valid lsof entries. I verified a bunch of them and they
all ended up pointing to the same durable connection cases that fail
as follows:

[2021-11-17 14:05:13,081Z][info  tid=32115] :Cache server: Initializing secondary server-to-client
communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
[2021-11-17 14:05:13,083Z][warn  tid=32115] The requested durable client has the same identifier (
tpdemo2@x_gem_x ) as an existing durable client (
CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
Duplicate durable clients are not allowed.
[2021-11-17 14:05:13,084Z][warn  tid=32115] CacheClientNotifier: Unsuccessfully registered client
with identifier
identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
timeout=86400]) and response code 64
[2021-11-17 14:05:13,098Z][warn  tid=290] Server connection from
[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
timeout=86400]); port=60242]: connection disconnect detected by EOF.

I then counted the number of those errors and found that the
difference is exactly the same as the leaked socket handles. So
everything points to this area in the server code. But I tried to
reproduce this in debugger and stepped through the geode code without
finding where the socket might be leaked. Put breakpoint on
CacheClientNotifier.registerClientInternal and the line where that
error happens. But then again the exception seems to bubble up to
AcceptorImpl.run where the is catch for IOException. And in case of
IOException, the client socket is closed. Is this the right socket I'm
after?

Does this make it any clearer for someone who is more familiar with
the code? NOTE: the duplicate durable client happens sometimes due to
race conditions with reconnects (I think). It's not happening all the
time. It's not causing any problems to the client in general. But the
server leak eventually causes us to run out of file handles for the
process :(

Thank you!


On Tue, Nov 16, 2021 at 4:33 PM Leon Finker  wrote:
>
> Hi Darrel,
>
> Thank you! I'll try to track it!
>
> On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider  wrote:
> >
> > This link: 
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsofdata=04%7C01%7Cdarrel%40vmware.com%7Cc51c566dbdb94f54bdf608d9a9f85987%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727706394598508%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=AOWZs7WTSsQ8Ue8FhaVkzLZcFxmtAc%2BlQa560OuW8Zs%3Dreserved=0
> >  points out that once the fd gets into this "can't identify protocol" state 
> > you can no longer figure out things like what port(s) and addresses are 
> > associated with the fd. They suggest running lsof periodically to catch 
> > that fd (in your example output the first fd is 133u) when it was still 
> > healthy. This would help track it down to an area of the geode code. For 
> > example you could see that it was on of the sockets using the port the 
> > cache server is listening on

Re: Open socket handles build up over time (leaking?)

2021-11-17 Thread Leon Finker
Following Darrel's excellent advice :) I think I tracked down the area
of the socket handle leak. As the article suggested, I ran the lsof
capture every 5 minutes. I then traced back the cleaned up socket
handles to the valid lsof entries. I verified a bunch of them and they
all ended up pointing to the same durable connection cases that fail
as follows:

[2021-11-17 14:05:13,081Z][info  tid=32115] :Cache server: Initializing secondary server-to-client
communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
[2021-11-17 14:05:13,083Z][warn  tid=32115] The requested durable client has the same identifier (
tpdemo2@x_gem_x ) as an existing durable client (
CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
Duplicate durable clients are not allowed.
[2021-11-17 14:05:13,084Z][warn  tid=32115] CacheClientNotifier: Unsuccessfully registered client
with identifier
identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
timeout=86400]) and response code 64
[2021-11-17 14:05:13,098Z][warn  tid=290] Server connection from
[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
timeout=86400]); port=60242]: connection disconnect detected by EOF.

I then counted the number of those errors and found that the
difference is exactly the same as the leaked socket handles. So
everything points to this area in the server code. But I tried to
reproduce this in debugger and stepped through the geode code without
finding where the socket might be leaked. Put breakpoint on
CacheClientNotifier.registerClientInternal and the line where that
error happens. But then again the exception seems to bubble up to
AcceptorImpl.run where the is catch for IOException. And in case of
IOException, the client socket is closed. Is this the right socket I'm
after?

Does this make it any clearer for someone who is more familiar with
the code? NOTE: the duplicate durable client happens sometimes due to
race conditions with reconnects (I think). It's not happening all the
time. It's not causing any problems to the client in general. But the
server leak eventually causes us to run out of file handles for the
process :(

Thank you!


On Tue, Nov 16, 2021 at 4:33 PM Leon Finker  wrote:
>
> Hi Darrel,
>
> Thank you! I'll try to track it!
>
> On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider  wrote:
> >
> > This link: 
> > https://help.mulesoft.com/s/article/How-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof
> >  points out that once the fd gets into this "can't identify protocol" state 
> > you can no longer figure out things like what port(s) and addresses are 
> > associated with the fd. They suggest running lsof periodically to catch 
> > that fd (in your example output the first fd is 133u) when it was still 
> > healthy. This would help track it down to an area of the geode code. For 
> > example you could see that it was on of the sockets using the port the 
> > cache server is listening on.
> > How to identify leaked file descriptors that are shown only as “can't 
> > identify protocol” in lsof | MuleSoft Help 
> > Center<https://help.mulesoft.com/s/article/How-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof>
> > Functional cookies enhance functions, performance, and services on the 
> > website. Some examples include: cookies used to analyze site traffic, 
> > cookies used for market research, and cookies used to display advertising 
> > that is not directed to a particular individual.
> > help.mulesoft.com
> >
> > ________________
> > From: Leon Finker 
> > Sent: Tuesday, November 16, 2021 9:28 AM
> > To: dev@geode.apache.org 
> > Subject: Open socket handles build up over time (leaking?)
> >
> > Hi,
> >
> > We observe in our geode (1.14 - same before as well in 1.13) cache
> > server (that supports durable client sessions) an increase in half
> > opened sockets. It seems there is a socket leak. Could someone
> > recommend how to track the leak down? It's not obvious where it's
> > leaking...I can only suspect AcceptorImpl.run and where it only
> > handles IOException. but I wasn't able to reproduce it in debugger
> > yet...
> >
> > lsof -p 344|grep "can't"
> >
> > java 344   

Re: Open socket handles build up over time (leaking?)

2021-11-16 Thread Leon Finker
Hi Darrel,

Thank you! I'll try to track it!

On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider  wrote:
>
> This link: 
> https://help.mulesoft.com/s/article/How-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof
>  points out that once the fd gets into this "can't identify protocol" state 
> you can no longer figure out things like what port(s) and addresses are 
> associated with the fd. They suggest running lsof periodically to catch that 
> fd (in your example output the first fd is 133u) when it was still healthy. 
> This would help track it down to an area of the geode code. For example you 
> could see that it was on of the sockets using the port the cache server is 
> listening on.
> How to identify leaked file descriptors that are shown only as “can't 
> identify protocol” in lsof | MuleSoft Help 
> Center<https://help.mulesoft.com/s/article/How-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof>
> Functional cookies enhance functions, performance, and services on the 
> website. Some examples include: cookies used to analyze site traffic, cookies 
> used for market research, and cookies used to display advertising that is not 
> directed to a particular individual.
> help.mulesoft.com
>
> 
> From: Leon Finker 
> Sent: Tuesday, November 16, 2021 9:28 AM
> To: dev@geode.apache.org 
> Subject: Open socket handles build up over time (leaking?)
>
> Hi,
>
> We observe in our geode (1.14 - same before as well in 1.13) cache
> server (that supports durable client sessions) an increase in half
> opened sockets. It seems there is a socket leak. Could someone
> recommend how to track the leak down? It's not obvious where it's
> leaking...I can only suspect AcceptorImpl.run and where it only
> handles IOException. but I wasn't able to reproduce it in debugger
> yet...
>
> lsof -p 344|grep "can't"
>
> java 344   user  133u  sock0,6   0t0 115956017
> can't identify protocol
> java 344   user  142u  sock0,6   0t0 113361870
> can't identify protocol
> java 344   user  143u  sock0,6   0t0 111979650
> can't identify protocol
> java 344   user  156u  sock0,6   0t0 117202529
> can't identify protocol
> java 344   user  178u  sock0,6   0t0 113357568
> can't identify protocol
> ...
>
> lsof -p 344|grep "can't"|wc -l
> 934
>
> Thank you


Re: Open socket handles build up over time (leaking?)

2021-11-16 Thread Leon Finker
Hi Anthony,

>- What OS for client and server?
Server: CentOS release 6.10
Client: Windows 10 and CentOS 6.*

The cache is connected to by durable and non durable clients. Durable
clients connect from Windows 10. And non durable clients connect from
backend CentOS servers. The server cache has no custom networking
code/libraries. It's purely a geode cache.

- What is the scenario?  Is it “normal” operation or is the client or
server killed?

Server is not killed and continuously running. Clients are also
normally connected. But they may lose connections and reconnect. The
issue is definitely caused by client side connections to server. But
I'm not able to isolate any specific exception from geode log.

>- Does netstat give you any additional information about the sockets?  Are any 
>in TIME_WAIT status?

Ran "netstat -altnup" and all tcp connections have only LISTEN and
ESTABLISHED states for the server process.

>- Do you have a reproducible test case?

Unfortunately no. I'm trying to isolate from geode logs what can be
causing the issue.

>- Do you have a tcpdump of the socket?

No. I'll try it. But not sure it will help? What to look for?
Unfortunately I'm not sure which connection causes the problem.

>- Are you seeing the sockets clean up over time or do they persist until a 
>reboot?
They seem to persist till jvm restart. Forcing GC has no effect on
their count. It slowly creeps up.

On Tue, Nov 16, 2021 at 12:44 PM Anthony Baker  wrote:
>
> Hi, thanks for this report.  Some questions to help us help you—
>
> - What OS for client and server?
> - Are you seeing the sockets clean up over time or do they persist until a 
> reboot?
> - Does netstat give you any additional information about the sockets?  Are 
> any in TIME_WAIT status?
> - Do you have a tcpdump of the socket?
> - What is the scenario?  Is it “normal” operation or is the client or server 
> killed?
> - Do you have a reproducible test case?
>
> Thanks,
> Anthony
>
>
>
>
> > On Nov 16, 2021, at 9:28 AM, Leon Finker  wrote:
> >
> > Hi,
> >
> > We observe in our geode (1.14 - same before as well in 1.13) cache
> > server (that supports durable client sessions) an increase in half
> > opened sockets. It seems there is a socket leak. Could someone
> > recommend how to track the leak down? It's not obvious where it's
> > leaking...I can only suspect AcceptorImpl.run and where it only
> > handles IOException. but I wasn't able to reproduce it in debugger
> > yet...
> >
> > lsof -p 344|grep "can't"
> >
> > java 344   user  133u  sock0,6   0t0 115956017
> > can't identify protocol
> > java 344   user  142u  sock0,6   0t0 113361870
> > can't identify protocol
> > java 344   user  143u  sock0,6   0t0 111979650
> > can't identify protocol
> > java 344   user  156u  sock0,6   0t0 117202529
> > can't identify protocol
> > java 344   user  178u  sock0,6   0t0 113357568
> > can't identify protocol
> > ...
> >
> > lsof -p 344|grep "can't"|wc -l
> > 934
> >
> > Thank you
>


Re: Open socket handles build up over time (leaking?)

2021-11-16 Thread Darrel Schneider
This link: 
https://help.mulesoft.com/s/article/How-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof
 points out that once the fd gets into this "can't identify protocol" state you 
can no longer figure out things like what port(s) and addresses are associated 
with the fd. They suggest running lsof periodically to catch that fd (in your 
example output the first fd is 133u) when it was still healthy. This would help 
track it down to an area of the geode code. For example you could see that it 
was on of the sockets using the port the cache server is listening on.
How to identify leaked file descriptors that are shown only as “can't identify 
protocol” in lsof | MuleSoft Help 
Center<https://help.mulesoft.com/s/article/How-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof>
Functional cookies enhance functions, performance, and services on the website. 
Some examples include: cookies used to analyze site traffic, cookies used for 
market research, and cookies used to display advertising that is not directed 
to a particular individual.
help.mulesoft.com


From: Leon Finker 
Sent: Tuesday, November 16, 2021 9:28 AM
To: dev@geode.apache.org 
Subject: Open socket handles build up over time (leaking?)

Hi,

We observe in our geode (1.14 - same before as well in 1.13) cache
server (that supports durable client sessions) an increase in half
opened sockets. It seems there is a socket leak. Could someone
recommend how to track the leak down? It's not obvious where it's
leaking...I can only suspect AcceptorImpl.run and where it only
handles IOException. but I wasn't able to reproduce it in debugger
yet...

lsof -p 344|grep "can't"

java 344   user  133u  sock0,6   0t0 115956017
can't identify protocol
java 344   user  142u  sock0,6   0t0 113361870
can't identify protocol
java 344   user  143u  sock0,6   0t0 111979650
can't identify protocol
java 344   user  156u  sock0,6   0t0 117202529
can't identify protocol
java 344   user  178u  sock0,6   0t0 113357568
can't identify protocol
...

lsof -p 344|grep "can't"|wc -l
934

Thank you


Re: Open socket handles build up over time (leaking?)

2021-11-16 Thread Anthony Baker
Hi, thanks for this report.  Some questions to help us help you—

- What OS for client and server?
- Are you seeing the sockets clean up over time or do they persist until a 
reboot?
- Does netstat give you any additional information about the sockets?  Are any 
in TIME_WAIT status?
- Do you have a tcpdump of the socket? 
- What is the scenario?  Is it “normal” operation or is the client or server 
killed?
- Do you have a reproducible test case?

Thanks,
Anthony




> On Nov 16, 2021, at 9:28 AM, Leon Finker  wrote:
> 
> Hi,
> 
> We observe in our geode (1.14 - same before as well in 1.13) cache
> server (that supports durable client sessions) an increase in half
> opened sockets. It seems there is a socket leak. Could someone
> recommend how to track the leak down? It's not obvious where it's
> leaking...I can only suspect AcceptorImpl.run and where it only
> handles IOException. but I wasn't able to reproduce it in debugger
> yet...
> 
> lsof -p 344|grep "can't"
> 
> java 344   user  133u  sock0,6   0t0 115956017
> can't identify protocol
> java 344   user  142u  sock0,6   0t0 113361870
> can't identify protocol
> java 344   user  143u  sock0,6   0t0 111979650
> can't identify protocol
> java 344   user  156u  sock0,6   0t0 117202529
> can't identify protocol
> java 344   user  178u  sock0,6   0t0 113357568
> can't identify protocol
> ...
> 
> lsof -p 344|grep "can't"|wc -l
> 934
> 
> Thank you



Open socket handles build up over time (leaking?)

2021-11-16 Thread Leon Finker
Hi,

We observe in our geode (1.14 - same before as well in 1.13) cache
server (that supports durable client sessions) an increase in half
opened sockets. It seems there is a socket leak. Could someone
recommend how to track the leak down? It's not obvious where it's
leaking...I can only suspect AcceptorImpl.run and where it only
handles IOException. but I wasn't able to reproduce it in debugger
yet...

lsof -p 344|grep "can't"

java 344   user  133u  sock0,6   0t0 115956017
can't identify protocol
java 344   user  142u  sock0,6   0t0 113361870
can't identify protocol
java 344   user  143u  sock0,6   0t0 111979650
can't identify protocol
java 344   user  156u  sock0,6   0t0 117202529
can't identify protocol
java 344   user  178u  sock0,6   0t0 113357568
can't identify protocol
...

lsof -p 344|grep "can't"|wc -l
934

Thank you