Hi shindig devs, We are in the process of upgrading from shindig 2.0 to 2.5-update1 and everything has gone ok, however, once we got into our production environment, we are seeing significant slowdowns for the opensocial RPC calls that shindig makes to itself when rendering a gadget.
This is obviously very dependent on how we’ve implemented the shindig interfaces in our own code, and also our infrastructure, however, so we’re hoping someone on the list can help give us some more ideas for areas to investigate inside shindig itself or in general. Here’s what’s happening: * Gadgets load fine when the app is not experiencing much load (< 10 users rendering 10-12 gadgets on a page) * Once a reasonable subset of users begins rendering gadgets, gadget render calls through the “ifr” endpoint start taking a very long time to respond * The problem gets worse from there * Even with extensive load testing we can’t recreate this problem in our testing environments * Our system adminstrators have assured us that the configurations of our servers are the same between int and prod This is an example of what we’re seeing from the logs inside BasicHttpFetcher: http://238redacteddnsprefix234.gadgetsv2.company.com:7001/gmodules/rpc?st=mycontainer%3AvY2rb-teGXuk9HX8d6W0rm6wE6hkLxM95ByaSMQlV8RudwohiAFqAliywVwc5yQ8maFSwK7IEhogNVnoUXa-doA3_h7EbSDGq_DW5i_VvC0CFEeaTKtr70A9XgYlAq5T95j7mivGO3lXVBTayU2PFNSdnLu8xtQEJJ7YrlmekEYyERmTSQmi7n2wZlmnG2puxVkegQKWNpdzOH4xCfgROnNCnAI is responding slowly. 12,449 ms elapsed. We’ll continue to get these warnings for rpc calls for many different gadgets, the amount of time elapsed will grow, and ultimately every gadget render slows to a crawl. Some other relevant information: * We have implemented “throttling” logic in our own custom HttpFetcher which extends the BasicHttpFetcher. Basically, what this does, is keep track of how many outgoing requests are happening for a given url, and if there are too many concurrent ones going at once, it will start rejecting outgoing requests. This was done to avoid a situation where an external service is responding slowly and ties up all of shindig’s external http connections. In our case, I believe that because our rpc endpoint is taking so long to respond, we start rejecting these requests with our throttling logic. I have tried to trace through the rpc calls inside the shindig code starting in the RpcServlet, and as best I can tell, these rpc calls are used for: * getting viewer data * getting application data * anything else? I’ve also looked at the BasicHTTPFetcher, but nothing stands out at me at first glance that would cause such a difference in performance between environments if, as our sys admins say, they are the same. Additionally, I’ve ensured that the database table which contains our Application Data has been indexed properly (by person ID and gadget url) and that person data is cached. Any other ideas, or areas in the codebase to explore are very much appreciated. Thanks! -Matt