Hi Ryan,

Thanks for responding!

I’ve attached our ehcacheConfig, however, comparing it to the default
configuration the only difference is the overall amount of elements (10000
in ours vs 1000 in default) and also the temp disk store location.

I’m assuming you are asking if each user in our system has the exact same
set of gadgets to render, correct?  If that’s the case: different users
have different sets of gadgets, however, many of them have a default set
we give them when they are initially setup in our system.  So, many people
will hit the same gadgets over and over again.  This default subset of
gadgets is about 10-12 different gadgets and that is by and large what
many users have. 

However, we have a total of 48 different gadgets that could be rendered by
a user at any given time on this instance of shindig.  We do run another
instance of shindig which could render a different subset of gadgets, but
that has a much lower usage and only renders about 10 different gadgets
altogether.


I am admittedly rusty with my ehCache configuration knowledge, but here’s
a couple things I noticed:
* I notice that the maxBytesLocalHeap in the ehCacheConfig is 50mb, which
seems low, however, this is the same setting we had in shindig 2.0, so I
have to wonder if that has anything to do with it.
* Our old ehCache configuration for shindig 2.0 specified a defaultCache
maxElementsInMemory of 1000 but NO sizeOfPolicy at all.
* Our new ehCache configuration for shindig 2.5 specifies a sizeOfPolicy
maxDepth of 10000 but NO defaultCache maxElementsInMemory.

Our heap sizes in tomcat are 2048mb which based on a 50m max heap for a
cache seems adequate. This is the same heap size from when we were using
shindig 2.0.  Unfortunately, we don’t have profiling tools enabled on our
Tomcat instances so i can’t see what the heap looked like when things
crashed, and like I said, we’re unable to reproduce this in int.

I think we might be on to something here… I will keep searching but if any
devs out there have any ideas, please let me know.

Thanks shindig list!
-Matt

On 7/13/14, 10:12 AM, "Ryan Baxter" <rbaxte...@gmail.com> wrote:

>Matt can you tell us more about how you have configured the caches in
>shindig?  When you are rendering these gadgets are you rendering the same
>gadget across all users?
>
>-Ryan
>
>> On Jul 9, 2014, at 3:31 PM, "Merrill, Matt" <mmerr...@mitre.org> wrote:
>> 
>> Stanton, 
>> 
>> Thanks for responding!
>> 
>> This is one instance of shindig.
>> 
>> If you mean the configuration within the container and for the shindig
>> java app, then yes, the locked domains are the same.  In fact, the
>> configuration with the exception of shindig¹s host URL¹s is exactly the
>> same from what I can tell.
>> 
>> Unfortunately, I don¹t have any way to trace that exact message, but I
>>did
>> do a traceroute from the server running shindig to the URL that is being
>> called for rpc calls to make sure there weren¹t any extra network hops,
>> and there weren¹t, it actually only had one, as expected for an app
>>making
>> an HTTP call to itself.
>> 
>> Thanks again for responding.
>> 
>> -Matt
>> 
>>> On 7/9/14, 3:08 PM, "Stanton Sievers" <ssiev...@apache.org> wrote:
>>> 
>>> Hi Matt,
>>> 
>>> Is the configuration for locked domains and security tokens consistent
>>> between your test and production environments?
>>> 
>>> Do you have any way of tracing the request in the log entry you
>>>provided
>>> through the network?  Is this a single Shindig server or is there any
>>>load
>>> balancing occurring?
>>> 
>>> Regards,
>>> -Stanton
>>> 
>>> 
>>>> On Wed, Jul 9, 2014 at 2:40 PM, Merrill, Matt <mmerr...@mitre.org>
>>>>wrote:
>>>> 
>>>> Hi shindig devs,
>>>> 
>>>> We are in the process of upgrading from shindig 2.0 to 2.5-update1 and
>>>> everything has gone ok, however, once we got into our production
>>>> environment, we are seeing significant slowdowns for the opensocial
>>>>RPC
>>>> calls that shindig makes to itself when rendering a gadget.
>>>> 
>>>> This is obviously very dependent on how we¹ve implemented the shindig
>>>> interfaces in our own code, and also our infrastructure, however, so
>>>> we¹re
>>>> hoping someone on the list can help give us some more ideas for areas
>>>>to
>>>> investigate inside shindig itself or in general.
>>>> 
>>>> Here¹s what¹s happening:
>>>> * Gadgets load fine when the app is not experiencing much load (< 10
>>>> users
>>>> rendering 10-12 gadgets on a page)
>>>> * Once a reasonable subset of users begins rendering gadgets, gadget
>>>> render calls through the ³ifr² endpoint start taking a very long time
>>>>to
>>>> respond
>>>> * The problem gets worse from there
>>>> * Even with extensive load testing we can¹t recreate this problem in
>>>>our
>>>> testing environments
>>>> * Our system adminstrators have assured us that the configurations of
>>>> our
>>>> servers are the same between int and prod
>>>> 
>>>> This is an example of what we¹re seeing from the logs inside
>>>> BasicHttpFetcher:
>>>> 
>>>> 
>>>> 
>>>>http://238redacteddnsprefix234.gadgetsv2.company.com:7001/gmodules/rpc?
>>>>st
>>>> 
>>>>=mycontainer%3AvY2rb-teGXuk9HX8d6W0rm6wE6hkLxM95ByaSMQlV8RudwohiAFqAliy
>>>>wV
>>>> 
>>>>wc5yQ8maFSwK7IEhogNVnoUXa-doA3_h7EbSDGq_DW5i_VvC0CFEeaTKtr70A9XgYlAq5T9
>>>>5j
>>>> 
>>>>7mivGO3lXVBTayU2PFNSdnLu8xtQEJJ7YrlmekEYyERmTSQmi7n2wZlmnG2puxVkegQKWNp
>>>>dz
>>>> OH4xCfgROnNCnAI
>>>> is responding slowly. 12,449 ms elapsed.
>>>> 
>>>> We¹ll continue to get these warnings for rpc calls for many different
>>>> gadgets, the amount of time elapsed will grow, and ultimately every
>>>> gadget
>>>> render slows to a crawl.
>>>> 
>>>> Some other relevant information:
>>>> * We have implemented ³throttling² logic in our own custom HttpFetcher
>>>> which extends the BasicHttpFetcher.  Basically, what this does, is
>>>>keep
>>>> track of how many outgoing requests are happening for a given url, and
>>>> if
>>>> there are too many concurrent ones going at once, it will start
>>>> rejecting
>>>> outgoing requests.  This was done to avoid a situation where an
>>>>external
>>>> service is responding slowly and ties up all of shindig¹s external
>>>>http
>>>> connections.  In our case, I believe that because our rpc endpoint is
>>>> taking so long to respond, we start rejecting these requests with our
>>>> throttling logic.
>>>> 
>>>> I have tried to trace through the rpc calls inside the shindig code
>>>> starting in the RpcServlet, and as best I can tell, these rpc calls
>>>>are
>>>> used for:
>>>> * getting viewer data
>>>> * getting application data
>>>> * anything else?
>>>> 
>>>> I¹ve also looked at the BasicHTTPFetcher, but nothing stands out at me
>>>> at
>>>> first glance that would cause such a difference in performance between
>>>> environments if, as our sys admins say, they are the same.
>>>> 
>>>> Additionally, I¹ve ensured that the database table which contains our
>>>> Application Data has been indexed properly (by person ID and gadget
>>>>url)
>>>> and that person data is cached.
>>>> 
>>>> Any other ideas, or areas in the codebase to explore are very much
>>>> appreciated.
>>>> 
>>>> Thanks!
>>>> -Matt
>> 

Attachment: template-shindig-ehcacheConfig.xml
Description: template-shindig-ehcacheConfig.xml

Reply via email to