Hello everyone,

We successfully deployed on production a cas v5.2.3 a couple of days ago.

Our configuration is : two active/passive cas nodes with a in memory (save JVM 
as cas) hazelcast cluster that replicates the tickets. 

Everything worked fine for the first two hours, but when the connections ramped 
up, the active node froze. We realized that the heap (2g max) was full, so we 
stopped both nodes to bump up the xmx to 6g on each nodes.

After that cas worked perfectly.
When monitoring the heap through the day, we noticed a very steep curve going 
from 1g around 9am to a max around 11am at 5.5g. Then the curve flattened and 
stayed around 5.5 until 8pm. After that the heap when down to around 4g

During the 11am - 8pm period, several things happened :

- master GC time increased up to 3s degrading the reponse time of the 
applications that use cas. We suspect this is related to cache eviction, the 
frequency was around one major GC every 30 min.

- some users where disconnected without notice during the afternoon (or had 
issues granting PTs), obviously a consequence of the cache hitting its max 
allowed size and aggressively evicting tickets.

We suspected an eviction problem with hazelcast, so we did a heap dump and we 
installed hazecast management center. 

Our first observations were :

- we had a backup count set at 1 which doubled the size of the cluster.
- we had a huge amount of PGT : around 200000 for 3000 TGT
- PGT are quite big >10k (dixit hazelcast mancenter)

So for the next day we disabled the hazelcast backup.

Now our heap usage is a little better.
The heap start around 1g at 9am to plateau at 5.5g around 12. From 12 to 4pm 
the curve stay flat around 5.5g with only minor GC. Around 4pm major gc occurs 
every 30 min until 6pm, the the heap goes down.

Our tickets are supposed to expire after 6h. So, the way I read this is : 
people start working around 9am,they produce a lot of tickets between 9 and 12, 
hence the steep curve. Between 12 and 14 the activity slows downs and ticket 
production stops while the tickets created around 8am start to be evicted 
slowly. After 14 activity starts again and tickets are created. Around 4pm the 
cache is full and massively evicts the tickets created in the morning hence the 
major GCs

No users complained about being disconnected, but the heap stay close to its 
max a good part of the day,and we still have around 200000 pgts for 3000 TGT. 
And we have around 350 thread runing all day.

Our configuration is :
Xmx 6g
Eviction policy : default with TTL 6h ttk 6h for tgt (and PGT)
LFU 
Hazelcast max heap size 70
GC g1c java 8
Cas War overlay with undertow
A dozen webapps using 60+ webservices all protected by cas


For now it works but we have to restart the nodes every nights to clean the 
heap. 
I don't like the idea of the heap being 90% full all the day, if the number of 
connections increases we might have unwanted disconnections again. And the 
thread number is a concern as well. And I would like to do something about 
these issues.

My questions :

- are these numbers normal ?
  - 200000 pgts for 3000 tgt
  - 3g of pgts ?
  - 350 thread all day ?
  - 90% of the heap full all day ?
  - is our eviction policy correct ?

I can't decide if we have a memory leak or if it's a normal situation 
considering our 3000 users and our 70+ applications linked by cas.
We would feel more comfortable is the heap wasn't at 90% all day.

We have several options now : 

- try lru instead of lfu
- reduce the tgt TTL to 4h
- use a different evicition policy like a timeout on the tickets
- bump up the xmx Hoping we would hit the sweet spot between memory consumption 
and cache eviction but taking the risk of lengthy major Gc
- put the hazelcast clusters in their own JVM
- do nothing because everything is normal ...


I know it's a long text so thank you for reading everything ! Any advice will 
be appreciated ! 

-- 
- Website: https://apereo.github.io/cas
- Gitter Chatroom: https://gitter.im/apereo/cas
- List Guidelines: https://goo.gl/1VRrw7
- Contributions: https://goo.gl/mh7qDG
--- 
You received this message because you are subscribed to the Google Groups "CAS 
Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to cas-user+unsubscr...@apereo.org.
To view this discussion on the web visit 
https://groups.google.com/a/apereo.org/d/msgid/cas-user/ec5d098d-d5f9-4ec3-99b0-0f773ca966b3%40apereo.org.

Reply via email to