Hi shindig devs,

We are in the process of upgrading from shindig 2.0 to 2.5-update1 and 
everything has gone ok, however, once we got into our production environment, 
we are seeing significant slowdowns for the opensocial RPC calls that shindig 
makes to itself when rendering a gadget.

This is obviously very dependent on how we’ve implemented the shindig 
interfaces in our own code, and also our infrastructure, however, so we’re 
hoping someone on the list can help give us some more ideas for areas to 
investigate inside shindig itself or in general.

Here’s what’s happening:
* Gadgets load fine when the app is not experiencing much load (< 10 users 
rendering 10-12 gadgets on a page)
* Once a reasonable subset of users begins rendering gadgets, gadget render 
calls through the “ifr” endpoint start taking a very long time to respond
* The problem gets worse from there
* Even with extensive load testing we can’t recreate this problem in our 
testing environments
* Our system adminstrators have assured us that the configurations of our 
servers are the same between int and prod

This is an example of what we’re seeing from the logs inside BasicHttpFetcher:
http://238redacteddnsprefix234.gadgetsv2.company.com:7001/gmodules/rpc?st=mycontainer%3AvY2rb-teGXuk9HX8d6W0rm6wE6hkLxM95ByaSMQlV8RudwohiAFqAliywVwc5yQ8maFSwK7IEhogNVnoUXa-doA3_h7EbSDGq_DW5i_VvC0CFEeaTKtr70A9XgYlAq5T95j7mivGO3lXVBTayU2PFNSdnLu8xtQEJJ7YrlmekEYyERmTSQmi7n2wZlmnG2puxVkegQKWNpdzOH4xCfgROnNCnAI
 is responding slowly. 12,449 ms elapsed.

We’ll continue to get these warnings for rpc calls for many different gadgets, 
the amount of time elapsed will grow, and ultimately every gadget render slows 
to a crawl.

Some other relevant information:
* We have implemented “throttling” logic in our own custom HttpFetcher which 
extends the BasicHttpFetcher.  Basically, what this does, is keep track of how 
many outgoing requests are happening for a given url, and if there are too many 
concurrent ones going at once, it will start rejecting outgoing requests.  This 
was done to avoid a situation where an external service is responding slowly 
and ties up all of shindig’s external http connections.  In our case, I believe 
that because our rpc endpoint is taking so long to respond, we start rejecting 
these requests with our throttling logic.

I have tried to trace through the rpc calls inside the shindig code starting in 
the RpcServlet, and as best I can tell, these rpc calls are used for:
* getting viewer data
* getting application data
* anything else?

I’ve also looked at the BasicHTTPFetcher, but nothing stands out at me at first 
glance that would cause such a difference in performance between environments 
if, as our sys admins say, they are the same.

Additionally, I’ve ensured that the database table which contains our 
Application Data has been indexed properly (by person ID and gadget url) and 
that person data is cached.

Any other ideas, or areas in the codebase to explore are very much appreciated.

Thanks!
-Matt

Reply via email to