Matt, I think further investigation is warranted.  I really think you
need to find a way to trace through the code and find where the
slowdown is occurring.  That will help us narrow down what the problem
is.  I know it is production, but getting some code on there that
starts timing method calls and such can be very useful.

On Tue, Jul 15, 2014 at 3:04 PM, Merrill, Matt <mmerr...@mitre.org> wrote:
> Hi Ryan,
>
> Thanks for responding!
>
> I’ve attached our ehcacheConfig, however, comparing it to the default
> configuration the only difference is the overall amount of elements (10000
> in ours vs 1000 in default) and also the temp disk store location.
>
> I’m assuming you are asking if each user in our system has the exact same
> set of gadgets to render, correct?  If that’s the case: different users
> have different sets of gadgets, however, many of them have a default set
> we give them when they are initially setup in our system.  So, many people
> will hit the same gadgets over and over again.  This default subset of
> gadgets is about 10-12 different gadgets and that is by and large what
> many users have.
>
> However, we have a total of 48 different gadgets that could be rendered by
> a user at any given time on this instance of shindig.  We do run another
> instance of shindig which could render a different subset of gadgets, but
> that has a much lower usage and only renders about 10 different gadgets
> altogether.
>
>
> I am admittedly rusty with my ehCache configuration knowledge, but here’s
> a couple things I noticed:
> * I notice that the maxBytesLocalHeap in the ehCacheConfig is 50mb, which
> seems low, however, this is the same setting we had in shindig 2.0, so I
> have to wonder if that has anything to do with it.
> * Our old ehCache configuration for shindig 2.0 specified a defaultCache
> maxElementsInMemory of 1000 but NO sizeOfPolicy at all.
> * Our new ehCache configuration for shindig 2.5 specifies a sizeOfPolicy
> maxDepth of 10000 but NO defaultCache maxElementsInMemory.
>
> Our heap sizes in tomcat are 2048mb which based on a 50m max heap for a
> cache seems adequate. This is the same heap size from when we were using
> shindig 2.0.  Unfortunately, we don’t have profiling tools enabled on our
> Tomcat instances so i can’t see what the heap looked like when things
> crashed, and like I said, we’re unable to reproduce this in int.
>
> I think we might be on to something here… I will keep searching but if any
> devs out there have any ideas, please let me know.
>
> Thanks shindig list!
> -Matt
>
> On 7/13/14, 10:12 AM, "Ryan Baxter" <rbaxte...@gmail.com> wrote:
>
>>Matt can you tell us more about how you have configured the caches in
>>shindig?  When you are rendering these gadgets are you rendering the same
>>gadget across all users?
>>
>>-Ryan
>>
>>> On Jul 9, 2014, at 3:31 PM, "Merrill, Matt" <mmerr...@mitre.org> wrote:
>>>
>>> Stanton,
>>>
>>> Thanks for responding!
>>>
>>> This is one instance of shindig.
>>>
>>> If you mean the configuration within the container and for the shindig
>>> java app, then yes, the locked domains are the same.  In fact, the
>>> configuration with the exception of shindig¹s host URL¹s is exactly the
>>> same from what I can tell.
>>>
>>> Unfortunately, I don¹t have any way to trace that exact message, but I
>>>did
>>> do a traceroute from the server running shindig to the URL that is being
>>> called for rpc calls to make sure there weren¹t any extra network hops,
>>> and there weren¹t, it actually only had one, as expected for an app
>>>making
>>> an HTTP call to itself.
>>>
>>> Thanks again for responding.
>>>
>>> -Matt
>>>
>>>> On 7/9/14, 3:08 PM, "Stanton Sievers" <ssiev...@apache.org> wrote:
>>>>
>>>> Hi Matt,
>>>>
>>>> Is the configuration for locked domains and security tokens consistent
>>>> between your test and production environments?
>>>>
>>>> Do you have any way of tracing the request in the log entry you
>>>>provided
>>>> through the network?  Is this a single Shindig server or is there any
>>>>load
>>>> balancing occurring?
>>>>
>>>> Regards,
>>>> -Stanton
>>>>
>>>>
>>>>> On Wed, Jul 9, 2014 at 2:40 PM, Merrill, Matt <mmerr...@mitre.org>
>>>>>wrote:
>>>>>
>>>>> Hi shindig devs,
>>>>>
>>>>> We are in the process of upgrading from shindig 2.0 to 2.5-update1 and
>>>>> everything has gone ok, however, once we got into our production
>>>>> environment, we are seeing significant slowdowns for the opensocial
>>>>>RPC
>>>>> calls that shindig makes to itself when rendering a gadget.
>>>>>
>>>>> This is obviously very dependent on how we¹ve implemented the shindig
>>>>> interfaces in our own code, and also our infrastructure, however, so
>>>>> we¹re
>>>>> hoping someone on the list can help give us some more ideas for areas
>>>>>to
>>>>> investigate inside shindig itself or in general.
>>>>>
>>>>> Here¹s what¹s happening:
>>>>> * Gadgets load fine when the app is not experiencing much load (< 10
>>>>> users
>>>>> rendering 10-12 gadgets on a page)
>>>>> * Once a reasonable subset of users begins rendering gadgets, gadget
>>>>> render calls through the ³ifr² endpoint start taking a very long time
>>>>>to
>>>>> respond
>>>>> * The problem gets worse from there
>>>>> * Even with extensive load testing we can¹t recreate this problem in
>>>>>our
>>>>> testing environments
>>>>> * Our system adminstrators have assured us that the configurations of
>>>>> our
>>>>> servers are the same between int and prod
>>>>>
>>>>> This is an example of what we¹re seeing from the logs inside
>>>>> BasicHttpFetcher:
>>>>>
>>>>>
>>>>>
>>>>>http://238redacteddnsprefix234.gadgetsv2.company.com:7001/gmodules/rpc?
>>>>>st
>>>>>
>>>>>=mycontainer%3AvY2rb-teGXuk9HX8d6W0rm6wE6hkLxM95ByaSMQlV8RudwohiAFqAliy
>>>>>wV
>>>>>
>>>>>wc5yQ8maFSwK7IEhogNVnoUXa-doA3_h7EbSDGq_DW5i_VvC0CFEeaTKtr70A9XgYlAq5T9
>>>>>5j
>>>>>
>>>>>7mivGO3lXVBTayU2PFNSdnLu8xtQEJJ7YrlmekEYyERmTSQmi7n2wZlmnG2puxVkegQKWNp
>>>>>dz
>>>>> OH4xCfgROnNCnAI
>>>>> is responding slowly. 12,449 ms elapsed.
>>>>>
>>>>> We¹ll continue to get these warnings for rpc calls for many different
>>>>> gadgets, the amount of time elapsed will grow, and ultimately every
>>>>> gadget
>>>>> render slows to a crawl.
>>>>>
>>>>> Some other relevant information:
>>>>> * We have implemented ³throttling² logic in our own custom HttpFetcher
>>>>> which extends the BasicHttpFetcher.  Basically, what this does, is
>>>>>keep
>>>>> track of how many outgoing requests are happening for a given url, and
>>>>> if
>>>>> there are too many concurrent ones going at once, it will start
>>>>> rejecting
>>>>> outgoing requests.  This was done to avoid a situation where an
>>>>>external
>>>>> service is responding slowly and ties up all of shindig¹s external
>>>>>http
>>>>> connections.  In our case, I believe that because our rpc endpoint is
>>>>> taking so long to respond, we start rejecting these requests with our
>>>>> throttling logic.
>>>>>
>>>>> I have tried to trace through the rpc calls inside the shindig code
>>>>> starting in the RpcServlet, and as best I can tell, these rpc calls
>>>>>are
>>>>> used for:
>>>>> * getting viewer data
>>>>> * getting application data
>>>>> * anything else?
>>>>>
>>>>> I¹ve also looked at the BasicHTTPFetcher, but nothing stands out at me
>>>>> at
>>>>> first glance that would cause such a difference in performance between
>>>>> environments if, as our sys admins say, they are the same.
>>>>>
>>>>> Additionally, I¹ve ensured that the database table which contains our
>>>>> Application Data has been indexed properly (by person ID and gadget
>>>>>url)
>>>>> and that person data is cached.
>>>>>
>>>>> Any other ideas, or areas in the codebase to explore are very much
>>>>> appreciated.
>>>>>
>>>>> Thanks!
>>>>> -Matt
>>>
>

Reply via email to