I think you found the leak! My understanding of the code in registerClientInternal (I'm looking at the current develop branch) is that when it logs the warning "Duplicate durable clients are not allowed" that it considers the current client connect attempt to have failed. It writes this response back to it: REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to throw ServerRefusedConnectionException. What seems wrong about this method is that even though it sets "unsuccessfulMsg" and correctly sends back a handshake saying the client is rejected, it does not throw an exception and it does not close "socket". I think right before it calls performPostAuthorization it should do the followiing: if (unsuccessfulMsg != null) { try { socket.close(); } catch (IOException ignore) { } } else { performPostAuthorization(...) } ________________________________ From: Leon Finker <leon...@gmail.com> Sent: Wednesday, November 17, 2021 10:30 AM To: dev@geode.apache.org <dev@geode.apache.org> Subject: Re: Open socket handles build up over time (leaking?)
Following Darrel's excellent advice :) I think I tracked down the area of the socket handle leak. As the article suggested, I ran the lsof capture every 5 minutes. I then traced back the cleaned up socket handles to the valid lsof entries. I verified a bunch of them and they all ended up pointing to the same durable connection cases that fail as follows: [2021-11-17 14:05:13,081Z][info <Client Queue Initialization Thread 17> tid=32115] :Cache server: Initializing secondary server-to-client communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011] [2021-11-17 14:05:13,083Z][warn <Client Queue Initialization Thread 17> tid=32115] The requested durable client has the same identifier ( tpdemo2@x_gem_x ) as an existing durable client ( CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba; timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ). Duplicate durable clients are not allowed. [2021-11-17 14:05:13,084Z][warn <Client Queue Initialization Thread 17> tid=32115] CacheClientNotifier: Unsuccessfully registered client with identifier identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba; timeout=86400]) and response code 64 [2021-11-17 14:05:13,098Z][warn <ServerConnection on port 40011 Thread 34> tid=290] Server connection from [identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba; timeout=86400]); port=60242]: connection disconnect detected by EOF. I then counted the number of those errors and found that the difference is exactly the same as the leaked socket handles. So everything points to this area in the server code. But I tried to reproduce this in debugger and stepped through the geode code without finding where the socket might be leaked. Put breakpoint on CacheClientNotifier.registerClientInternal and the line where that error happens. But then again the exception seems to bubble up to AcceptorImpl.run where the is catch for IOException. And in case of IOException, the client socket is closed. Is this the right socket I'm after? Does this make it any clearer for someone who is more familiar with the code? NOTE: the duplicate durable client happens sometimes due to race conditions with reconnects (I think). It's not happening all the time. It's not causing any problems to the client in general. But the server leak eventually causes us to run out of file handles for the process :( Thank you! On Tue, Nov 16, 2021 at 4:33 PM Leon Finker <leon...@gmail.com> wrote: > > Hi Darrel, > > Thank you! I'll try to track it! > > On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider <dar...@vmware.com> wrote: > > > > This link: > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof&data=04%7C01%7Cdarrel%40vmware.com%7Cc51c566dbdb94f54bdf608d9a9f85987%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727706394598508%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AOWZs7WTSsQ8Ue8FhaVkzLZcFxmtAc%2BlQa560OuW8Zs%3D&reserved=0 > > points out that once the fd gets into this "can't identify protocol" state > > you can no longer figure out things like what port(s) and addresses are > > associated with the fd. They suggest running lsof periodically to catch > > that fd (in your example output the first fd is 133u) when it was still > > healthy. This would help track it down to an area of the geode code. For > > example you could see that it was on of the sockets using the port the > > cache server is listening on. > > How to identify leaked file descriptors that are shown only as “can't > > identify protocol” in lsof | MuleSoft Help > > Center<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof&data=04%7C01%7Cdarrel%40vmware.com%7Cc51c566dbdb94f54bdf608d9a9f85987%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727706394598508%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AOWZs7WTSsQ8Ue8FhaVkzLZcFxmtAc%2BlQa560OuW8Zs%3D&reserved=0> > > Functional cookies enhance functions, performance, and services on the > > website. Some examples include: cookies used to analyze site traffic, > > cookies used for market research, and cookies used to display advertising > > that is not directed to a particular individual. > > help.mulesoft.com > > > > ________________________________ > > From: Leon Finker <leon...@gmail.com> > > Sent: Tuesday, November 16, 2021 9:28 AM > > To: dev@geode.apache.org <dev@geode.apache.org> > > Subject: Open socket handles build up over time (leaking?) > > > > Hi, > > > > We observe in our geode (1.14 - same before as well in 1.13) cache > > server (that supports durable client sessions) an increase in half > > opened sockets. It seems there is a socket leak. Could someone > > recommend how to track the leak down? It's not obvious where it's > > leaking...I can only suspect AcceptorImpl.run and where it only > > handles IOException. but I wasn't able to reproduce it in debugger > > yet... > > > > lsof -p 344|grep "can't" > > > > java 344 user 133u sock 0,6 0t0 115956017 > > can't identify protocol > > java 344 user 142u sock 0,6 0t0 113361870 > > can't identify protocol > > java 344 user 143u sock 0,6 0t0 111979650 > > can't identify protocol > > java 344 user 156u sock 0,6 0t0 117202529 > > can't identify protocol > > java 344 user 178u sock 0,6 0t0 113357568 > > can't identify protocol > > ... > > > > lsof -p 344|grep "can't"|wc -l > > 934 > > > > Thank you