Re: Open socket handles build up over time (leaking?)

2021-12-03 Thread Leon Finker
Hi,

Unfortunately I'm not able to find the cause of the remaining minor
socket handle leak.I've tried tracing it for a week. It won't cause us
any major issues since it leaks about 10 handles per day on average.
For example, after 15 days, there are only 160 socket handles leaked
on one cache peer and on the other one 120. It is definitely for the
cache port and durable client connections. Only those show up in the
lsof captures as problematic.

I tried to capture lsof and enabled debug logging in geode to be able
to correlate ports and maybe get some additional information for that
connection. But there is nothing in the log for this. I've correlated
about 10 of these CLOSE_WAIT and then abandoned sockets and in all
cases geode at debug says the following for those ports as very last
thing:

[2021-12-03 10:20:01,056Z][debug  tid=74] Checking whether to
ping 
CacheClientProxy[identity(192.168.66.7(xxx:loner):65188:000fae7f:xxx(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=xxx_gem_Amoeba;
timeout=86400]); port=59084; primary=true; version=GEODE 1.14.0]

[2021-12-03 10:20:01,056Z][debug  tid=74] Not pinging because
not idle: 
CacheClientProxy[identity(192.168.66.7(xxx:loner):65188:000fae7f:xxx(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=xx_gem_Amoeba;
timeout=86400]); port=59084; primary=true; version=GEODE 1.14.0]

(I replaced durable id details with xxx above). But after the above
lines, there is no longer any mention of port=59084 or :59084. Nothing
after hours or days. I can see the same user/durable id reconnect
after some time but on different port. The port that is lost has no
activity in the logs ever again.

I don't know how to debug it further. The other leak was easy compared
to this since there was an exception in the log and some activity for
the port so was easy to correlate.

But just to confirm the fix for the other leak is working perfectly
fine. After 2 weeks, all good!

Thank you very much!

On Mon, Nov 22, 2021 at 1:45 PM Darrel Schneider  wrote:
>
> Thanks for filing the ticket and for looking into the additional leak
> 
> From: Leon Finker 
> Sent: Monday, November 22, 2021 10:19 AM
> To: dev@geode.apache.org 
> Subject: Re: Open socket handles build up over time (leaking?)
>
> Hi,
>
> Bug ticket: 
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9819data=04%7C01%7Cdarrel%40vmware.com%7C8015b9834b2845b645a108d9ade4ab34%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637732019908748063%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=Ngv07U56igvyPoQHxXEtZXFGxEtUVcyJKde%2F1N2gAlQ%3Dreserved=0
>
> I applied the fix locally and now this leak is fixed. Thank you!
>
> There is still a minor leak of about 10 handles per day that I want to
> track down next week. From the looks of it it's not as simple to track
> as this one :( Still happens in the durable client connection use case
> and cache server port. Seems to be when an existing durable connection
> is disconnected and automatically reconnected. I need to do more
> testing first to know for sure.
>
> On Wed, Nov 17, 2021 at 4:22 PM Darrel Schneider  wrote:
> >
> > Yes, open a geode bug ticket
> > 
> > From: Leon Finker 
> > Sent: Wednesday, November 17, 2021 1:13 PM
> > To: dev@geode.apache.org 
> > Subject: Re: Open socket handles build up over time (leaking?)
> >
> > Thank you! I'll try to test this change. What's the procedure for
> > this? Should I open geode bug ticket?
> >
> > On Wed, Nov 17, 2021 at 2:22 PM Darrel Schneider  wrote:
> > >
> > > I think you found the leak!
> > > My understanding of the code in registerClientInternal (I'm looking at 
> > > the current develop branch) is that when it logs the warning "Duplicate 
> > > durable clients are not allowed" that it considers the current client 
> > > connect attempt to have failed. It writes this response back to it: 
> > > REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to 
> > > throw ServerRefusedConnectionException. What seems wrong about this 
> > > method is that even though it sets "unsuccessfulMsg" and correctly sends 
> > > back a handshake saying the client is rejected, it does not throw an 
> > > exception and it does not close "socket". I think right before it calls 
> > > performPostAuthorization it should do the followiing:
> > > if (unsuccessfulMsg != null) {
> > >   try {
> > > socket.close();
> > >   } catch (IOException ignore) {
> > >   }
> > >  } else {
> > > performPostAuthorization(...)
> > > }
> > > 
> > > From: Leon Finker 
> > > Sent: Wednesday, November 17, 2021 10:30 AM
> > > To: dev@geode.apache.org 
> > > Subject: Re: Open socket handles build up over time (leaking?)
> > >
> > > Following Darrel's excellent advice :) I think I 

Re: Question related to orphaned .drf files in disk-store

2021-12-03 Thread Anthony Baker
Got it, that seems entirely reasonable.

Anthony


> On Dec 3, 2021, at 7:37 AM, Jakov Varenina  wrote:
> 
> Hi Anthony,
> 
> Not sure normally, but at the moment when we were investigating the issue 
> there were 21 .crf files in disk-store (on one server) with default 
> max-oplog-size (1GB) and compaction-threshold (50%).
> 
> BRs/Jakov
> 
> On 02. 12. 2021. 17:06, Anthony Baker wrote:
>> Related but different question:  how many active oplogs do you normally see 
>> at one time?  You may want to adjust the max-oplog-size if the default of 1 
>> GB is too small.
>> 



Re: Question related to orphaned .drf files in disk-store

2021-12-03 Thread Jakov Varenina

Hi Anthony,

Not sure normally, but at the moment when we were investigating the 
issue there were 21 .crf files in disk-store (on one server) with 
default max-oplog-size (1GB) and compaction-threshold (50%).


BRs/Jakov

On 02. 12. 2021. 17:06, Anthony Baker wrote:

Related but different question:  how many active oplogs do you normally see at 
one time?  You may want to adjust the max-oplog-size if the default of 1 GB is 
too small.

On Dec 2, 2021, at 1:11 AM, Jakov Varenina 
mailto:jakov.varen...@est.tech>> wrote:

Hi Dan,

We forget to mention that we actually configure off-heap for the regions, so 
cache entry values are stored outside the heap memory. Only Oplog objects that 
are not compacted and that have .crf file are referencing the live entries from 
the cache. These Oplog objects are not stored in onlyDrfOplogs hashmap. In 
onlyDrfOplogs map are only Oplog objects that are representing orphaned .drf 
files (the one without accompanying .crf and .krf file). These objects have 
been compacted and doesn't contain a reference to any live entry from the 
cache. All of these 18G is actually occupied by empty pendingKrfTags hashmaps.

In this case there are 7680 Oplog objects stored in onlyDrfOplogs. Every Oplog 
object has it's own regionMap hashmap. Every regionMap can contain hundreds of 
empty pendingKrfTags hashMaps. When you bring all that together you get more 
then 18G of unnecessary heap memory.

Thank you all for quick review of PR and fast response to our questions!

BRs/Jakov


On 02. 12. 2021. 00:25, Dan Smith wrote:
Interesting. It does look like that pendingKrfTags structure is wasting memory.

I think that retained heap of 20 gigabytes might include your entire cache, 
because those objects have references back to the Cache object. However with 6K 
oplogs each having an empty map with 4K elements that does add up.

-Dan

*From:* Jakov Varenina mailto:jakov.varen...@est.tech>>
*Sent:* Tuesday, November 30, 2021 5:53 AM
*To:* dev@geode.apache.org 
mailto:dev@geode.apache.org>>
*Subject:* Re: Question related to orphaned .drf files in disk-store

Hi Dan and all,


Just to provide you the additional picture that better represents the severity 
of the problem with pendingKrfsTag. So when after you check the second picture 
in below mail, then please come back and check this one also. Here you can see 
that the pendingKerfTags is empty and has capacity of 9,192 allocated in memory.



Sorry for any inconvenience.

BRs/Jakov


On 30. 11. 2021. 09:32, Jakov Varenina wrote:

Hi Dan,


Thank you for your answer!


We have identify memory leak in Oplog objects that are representing orphaned 
.drf files in heap memory. In below screenshoot you can see that 7680 
onlyDrfOplogs consume more than 18 GB of heap which doesn't seem correct.



In below picture you can see that the drfOnlyPlog.Oplog.regionMap.pendingKrfTgs 
structure is responsible for more then 95% of drfOnlyOplogs heap memory.




The pendingKrfTags structure is actually empty (although it consumes memory 
because it was used previously and the size of the HashMap was not reduced) and 
not used by the onlyDrfOplogs objects. Additionally, the regionMap.liveEntries 
linked list has just one element (fake disk entry OlogDiskEntry indicating that 
it is empty) and it is also not used. You can find more details why 
pedingKrfTags sturcture remianed in memory for Oplog object representing 
orphaned .drf file and possible solution in the following ticket and the PR:


https://issues.apache.org/jira/browse/GEODE-9854 


https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F7145data=04%7C01%7Cbakera%40vmware.com%7Cbd2758bd10a54592fcef08d9b573cebe%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637740331264764503%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=Ua49Y%2F4PwKhQHHgz898uxDLde%2BZpZZFxMBY%2FgIL8%2BEE%3Dreserved=0
 



BRs/Jakov



On 24. 11. 2021. 23:12, Dan Smith wrote:
The .drf file contains destroy records for entries in any