Re: [cas-user] Rapid Memory Consumption and Interpreting Heap Dump

David A. Kovacic Fri, 12 Dec 2014 07:11:43 -0800

Let me also expand on this a bit:

In the normal course of events seeing 2-3K/hr worth of STs being created
is not unusual for us.  Of those typically half (roughly 1500) are
Google logins, the key difference being that they are all different
users with very few repeats.  During one of those "normal" hours even
with that number of Google logins the heap usage stays fairly stable at
about 500MB used out of 1000MB allocated.  A typical pattern looks like:




The graph shows things coming in and going out of the heap on a very
regular basis as can be seen. 

When the "issue" triggers, the used heap appears to grow without ever
dropping again (at least until the heap runs out of memory and the
service needs to be restarted).  There MAY be some GC process that would
clean up the heap, but it is not running in a timely enough fashion to
prevent the heap memory from being exhausted, at least if the "issue"
generated sufficient logins to Google.

I would speculate, given what we are seeing, that some trigger event
(lost SAML response from Google maybe) causes Google to think the login
succeeded, but the SSO server to think it failed, causing it generate a
new ST, and another, and another, until something kicks in (some sort of
timer?) and the cycle terminates, but several instances of some variable
never get properly derefed and take a LONG time to get GCed.


On 12/12/14 6:20 AM, David A. Kovacic wrote:
> Exactly right.  If we see 3000 STs created in an hour for a user, all
> to Google, we are also seeing Google report 3000 successful logins in
> the same time frame reported in the Google admin console audit logs. 
> As far as I can tell, whatever condition triggers this (and it may be
> some form of malware being used to send spam through us) gets
> credentials once (only one TGT is ever created) and then somehow does
> >1000 logins in about an hour to Google without ever logging back out
> of Google.  As far as we can tell, there are no errors in the process
> of logging in, but because the Google SAML process seems to leave
> fairly large remnants of instances in the heap, and those remnants are
> not being GCed in a timely fashion, we run out of heap memory and the
> SSO server process locks up, taking the other server with it.  To
> summarize, it seems not to be the SAML process itself, but the VOLUME
> of SAML processes in a very short time that seems to cause the issue.
>
>
> On 12/11/14 9:01 PM, Sean Baker wrote:
>> Now that's interesting -- is that to say that when you see these
>> rapidly-generated service tickets for particular users you're seeing
>> them logging in as many times to Google as well?
>>
>>
>>
>>
>> On 12/11/14, 14:17 PM, David A. Kovacic wrote:
>>> Google seems to be accepting the assertions each time as we are
>>> seeing the same number of logins in Google's audit logs as the
>>> number of STs being created.  I would expect that if there was
>>> something wrong with assertion we would be receiving complaints from
>>> the users.  I am more inclined at this point to believe some sort of
>>> crazy browser loop, but it's definitely not happening with any
>>> consistency. 
>>>
>>> We have tried contacting the two people we identified once we
>>> started to get a handle on what the issue was, however neither has
>>> responded.  That's not terribly surprising given that we are in our
>>> finals period here and requests for information go pretty much
>>> ignored by students and faculty alike at this time.
>>>
>>> Dave
>>>
>>> On 12/10/14 8:14 PM, Sean Baker wrote:
>>>> Your access logs should show the individual SAMLRequest's generated by 
>>>> Google; if it's rejecting your assertions in some automated way you 
>>>> should see a new SAMLRequest each time.  If it's the same request over 
>>>> and over, one might infer a more local issue (not definitively mind you; 
>>>> just much more likely) [ehcache issue, browser configuration, etc.].
>>>>
>>>> Has anyone talked with your end users who're triggering these events 
>>>> about what they experienced?
>>>>
>>>> On 12/10/14, 15:16 PM, David A. Kovacic wrote:
>>>>> Does anyone know what I would need to do to be able to log the
>>>>> actual SAML transactions?  Is there any way to actually do that? 
>>>>> We have isolated this issue to only logins to Google and only
>>>>> under certain conditions when something seems to start looping and
>>>>> generating STs rapidly.  We are trying to isolate the conditions
>>>>> under which the loop starts. 
>>>>>
>>>>> It would be helpful to actually see the SAML transactions being
>>>>> generated so we could begin to get a handle on what Google apps is
>>>>> being referenced and if Google is returning any errors or not
>>>>> (although Google claims valid logins).
>>>>>
>>>>>
>>>>> On 12/6/14 9:11 AM, Marvin Addison wrote:
>>>>>>
>>>>>>     Second, the massive number of  STs are being created on only
>>>>>>     one server (we can tell by the host name in the logged ST)
>>>>>>     but the OTHER SERVER is where the memory is growing out of
>>>>>>     bounds.
>>>>>>
>>>>>>
>>>>>> I'm still working through this thread, but I wanted to point out
>>>>>> that the other is hurting likely because of load balancer session
>>>>>> affinity. Recall that ticket validation is a back-channel call,
>>>>>> and the network source differs from that of the user's browser.
>>>>>> In our environment, services typically get stuck on one node
>>>>>> causing hot spots. This is because the service is validating
>>>>>> tickets frequently enough that the session affinity timeout never
>>>>>> kicks in.
>>>>>>
>>>>>> M
>>>>>>
>>>>>> -- 
>>>>>> You are currently subscribed to [email protected] as: 
>>>>>> [email protected]
>>>>>> To unsubscribe, change settings or access archives, see 
>>>>>> http://www.ja-sig.org/wiki/display/JSG/cas-user
>>>>>>
>>>>>> -- 
>>>>>>  
>>>>> -- 
>>>>> You are currently subscribed to [email protected] as: 
>>>>> [email protected]
>>>>> To unsubscribe, change settings or access archives, see 
>>>>> http://www.ja-sig.org/wiki/display/JSG/cas-user
>>>>
>>>> -- 
>>>> You are currently subscribed to [email protected] as: [email protected]
>>>> To unsubscribe, change settings or access archives, see 
>>>> http://www.ja-sig.org/wiki/display/JSG/cas-user
>>> -- 
>>> You are currently subscribed to [email protected] as: 
>>> [email protected]
>>> To unsubscribe, change settings or access archives, see 
>>> http://www.ja-sig.org/wiki/display/JSG/cas-user
>>
>> -- 
>> You are currently subscribed to [email protected] as: [email protected]
>> To unsubscribe, change settings or access archives, see 
>> http://www.ja-sig.org/wiki/display/JSG/cas-user
> -- 
> You are currently subscribed to [email protected] as: [email protected]
> To unsubscribe, change settings or access archives, see 
> http://www.ja-sig.org/wiki/display/JSG/cas-user

-- 
You are currently subscribed to [email protected] as: 
[email protected]
To unsubscribe, change settings or access archives, see 
http://www.ja-sig.org/wiki/display/JSG/cas-user

Re: [cas-user] Rapid Memory Consumption and Interpreting Heap Dump

Reply via email to