Hi Matt,

Is the configuration for locked domains and security tokens consistent
between your test and production environments?

Do you have any way of tracing the request in the log entry you provided
through the network?  Is this a single Shindig server or is there any load
balancing occurring?

Regards,
-Stanton


On Wed, Jul 9, 2014 at 2:40 PM, Merrill, Matt <mmerr...@mitre.org> wrote:

> Hi shindig devs,
>
> We are in the process of upgrading from shindig 2.0 to 2.5-update1 and
> everything has gone ok, however, once we got into our production
> environment, we are seeing significant slowdowns for the opensocial RPC
> calls that shindig makes to itself when rendering a gadget.
>
> This is obviously very dependent on how we’ve implemented the shindig
> interfaces in our own code, and also our infrastructure, however, so we’re
> hoping someone on the list can help give us some more ideas for areas to
> investigate inside shindig itself or in general.
>
> Here’s what’s happening:
> * Gadgets load fine when the app is not experiencing much load (< 10 users
> rendering 10-12 gadgets on a page)
> * Once a reasonable subset of users begins rendering gadgets, gadget
> render calls through the “ifr” endpoint start taking a very long time to
> respond
> * The problem gets worse from there
> * Even with extensive load testing we can’t recreate this problem in our
> testing environments
> * Our system adminstrators have assured us that the configurations of our
> servers are the same between int and prod
>
> This is an example of what we’re seeing from the logs inside
> BasicHttpFetcher:
>
> http://238redacteddnsprefix234.gadgetsv2.company.com:7001/gmodules/rpc?st=mycontainer%3AvY2rb-teGXuk9HX8d6W0rm6wE6hkLxM95ByaSMQlV8RudwohiAFqAliywVwc5yQ8maFSwK7IEhogNVnoUXa-doA3_h7EbSDGq_DW5i_VvC0CFEeaTKtr70A9XgYlAq5T95j7mivGO3lXVBTayU2PFNSdnLu8xtQEJJ7YrlmekEYyERmTSQmi7n2wZlmnG2puxVkegQKWNpdzOH4xCfgROnNCnAI
> is responding slowly. 12,449 ms elapsed.
>
> We’ll continue to get these warnings for rpc calls for many different
> gadgets, the amount of time elapsed will grow, and ultimately every gadget
> render slows to a crawl.
>
> Some other relevant information:
> * We have implemented “throttling” logic in our own custom HttpFetcher
> which extends the BasicHttpFetcher.  Basically, what this does, is keep
> track of how many outgoing requests are happening for a given url, and if
> there are too many concurrent ones going at once, it will start rejecting
> outgoing requests.  This was done to avoid a situation where an external
> service is responding slowly and ties up all of shindig’s external http
> connections.  In our case, I believe that because our rpc endpoint is
> taking so long to respond, we start rejecting these requests with our
> throttling logic.
>
> I have tried to trace through the rpc calls inside the shindig code
> starting in the RpcServlet, and as best I can tell, these rpc calls are
> used for:
> * getting viewer data
> * getting application data
> * anything else?
>
> I’ve also looked at the BasicHTTPFetcher, but nothing stands out at me at
> first glance that would cause such a difference in performance between
> environments if, as our sys admins say, they are the same.
>
> Additionally, I’ve ensured that the database table which contains our
> Application Data has been indexed properly (by person ID and gadget url)
> and that person data is cached.
>
> Any other ideas, or areas in the codebase to explore are very much
> appreciated.
>
> Thanks!
> -Matt
>

Reply via email to