Re: Performance problems with opensocial rpc calls

Merrill, Matt Fri, 18 Jul 2014 05:50:08 -0700

Yep, that’s where I’m headed next.  Obviously there’s some hesitation to
do that on the part of our product owners so it takes a while to get to
that point.


Will let you know what I find.

Thanks!
-Matt

On 7/18/14, 8:44 AM, "Ryan Baxter" <rbaxte...@gmail.com> wrote:

>Matt, I think further investigation is warranted.  I really think you
>need to find a way to trace through the code and find where the
>slowdown is occurring.  That will help us narrow down what the problem
>is.  I know it is production, but getting some code on there that
>starts timing method calls and such can be very useful.
>
>On Tue, Jul 15, 2014 at 3:04 PM, Merrill, Matt <mmerr...@mitre.org> wrote:
>> Hi Ryan,
>>
>> Thanks for responding!
>>
>> I’ve attached our ehcacheConfig, however, comparing it to the default
>> configuration the only difference is the overall amount of elements
>>(10000
>> in ours vs 1000 in default) and also the temp disk store location.
>>
>> I’m assuming you are asking if each user in our system has the exact
>>same
>> set of gadgets to render, correct?  If that’s the case: different users
>> have different sets of gadgets, however, many of them have a default set
>> we give them when they are initially setup in our system.  So, many
>>people
>> will hit the same gadgets over and over again.  This default subset of
>> gadgets is about 10-12 different gadgets and that is by and large what
>> many users have.
>>
>> However, we have a total of 48 different gadgets that could be rendered
>>by
>> a user at any given time on this instance of shindig.  We do run another
>> instance of shindig which could render a different subset of gadgets,
>>but
>> that has a much lower usage and only renders about 10 different gadgets
>> altogether.
>>
>>
>> I am admittedly rusty with my ehCache configuration knowledge, but
>>here’s
>> a couple things I noticed:
>> * I notice that the maxBytesLocalHeap in the ehCacheConfig is 50mb,
>>which
>> seems low, however, this is the same setting we had in shindig 2.0, so I
>> have to wonder if that has anything to do with it.
>> * Our old ehCache configuration for shindig 2.0 specified a defaultCache
>> maxElementsInMemory of 1000 but NO sizeOfPolicy at all.
>> * Our new ehCache configuration for shindig 2.5 specifies a sizeOfPolicy
>> maxDepth of 10000 but NO defaultCache maxElementsInMemory.
>>
>> Our heap sizes in tomcat are 2048mb which based on a 50m max heap for a
>> cache seems adequate. This is the same heap size from when we were using
>> shindig 2.0.  Unfortunately, we don’t have profiling tools enabled on
>>our
>> Tomcat instances so i can’t see what the heap looked like when things
>> crashed, and like I said, we’re unable to reproduce this in int.
>>
>> I think we might be on to something here… I will keep searching but if
>>any
>> devs out there have any ideas, please let me know.
>>
>> Thanks shindig list!
>> -Matt
>>
>> On 7/13/14, 10:12 AM, "Ryan Baxter" <rbaxte...@gmail.com> wrote:
>>
>>>Matt can you tell us more about how you have configured the caches in
>>>shindig?  When you are rendering these gadgets are you rendering the
>>>same
>>>gadget across all users?
>>>
>>>-Ryan
>>>
>>>> On Jul 9, 2014, at 3:31 PM, "Merrill, Matt" <mmerr...@mitre.org>
>>>>wrote:
>>>>
>>>> Stanton,
>>>>
>>>> Thanks for responding!
>>>>
>>>> This is one instance of shindig.
>>>>
>>>> If you mean the configuration within the container and for the shindig
>>>> java app, then yes, the locked domains are the same.  In fact, the
>>>> configuration with the exception of shindig¹s host URL¹s is exactly
>>>>the
>>>> same from what I can tell.
>>>>
>>>> Unfortunately, I don¹t have any way to trace that exact message, but I
>>>>did
>>>> do a traceroute from the server running shindig to the URL that is
>>>>being
>>>> called for rpc calls to make sure there weren¹t any extra network
>>>>hops,
>>>> and there weren¹t, it actually only had one, as expected for an app
>>>>making
>>>> an HTTP call to itself.
>>>>
>>>> Thanks again for responding.
>>>>
>>>> -Matt
>>>>
>>>>> On 7/9/14, 3:08 PM, "Stanton Sievers" <ssiev...@apache.org> wrote:
>>>>>
>>>>> Hi Matt,
>>>>>
>>>>> Is the configuration for locked domains and security tokens
>>>>>consistent
>>>>> between your test and production environments?
>>>>>
>>>>> Do you have any way of tracing the request in the log entry you
>>>>>provided
>>>>> through the network?  Is this a single Shindig server or is there any
>>>>>load
>>>>> balancing occurring?
>>>>>
>>>>> Regards,
>>>>> -Stanton
>>>>>
>>>>>
>>>>>> On Wed, Jul 9, 2014 at 2:40 PM, Merrill, Matt <mmerr...@mitre.org>
>>>>>>wrote:
>>>>>>
>>>>>> Hi shindig devs,
>>>>>>
>>>>>> We are in the process of upgrading from shindig 2.0 to 2.5-update1
>>>>>>and
>>>>>> everything has gone ok, however, once we got into our production
>>>>>> environment, we are seeing significant slowdowns for the opensocial
>>>>>>RPC
>>>>>> calls that shindig makes to itself when rendering a gadget.
>>>>>>
>>>>>> This is obviously very dependent on how we¹ve implemented the
>>>>>>shindig
>>>>>> interfaces in our own code, and also our infrastructure, however, so
>>>>>> we¹re
>>>>>> hoping someone on the list can help give us some more ideas for
>>>>>>areas
>>>>>>to
>>>>>> investigate inside shindig itself or in general.
>>>>>>
>>>>>> Here¹s what¹s happening:
>>>>>> * Gadgets load fine when the app is not experiencing much load (< 10
>>>>>> users
>>>>>> rendering 10-12 gadgets on a page)
>>>>>> * Once a reasonable subset of users begins rendering gadgets, gadget
>>>>>> render calls through the ³ifr² endpoint start taking a very long
>>>>>>time
>>>>>>to
>>>>>> respond
>>>>>> * The problem gets worse from there
>>>>>> * Even with extensive load testing we can¹t recreate this problem in
>>>>>>our
>>>>>> testing environments
>>>>>> * Our system adminstrators have assured us that the configurations
>>>>>>of
>>>>>> our
>>>>>> servers are the same between int and prod
>>>>>>
>>>>>> This is an example of what we¹re seeing from the logs inside
>>>>>> BasicHttpFetcher:
>>>>>>
>>>>>>
>>>>>>
>>>>>>http://238redacteddnsprefix234.gadgetsv2.company.com:7001/gmodules/rp
>>>>>>c?
>>>>>>st
>>>>>>
>>>>>>=mycontainer%3AvY2rb-teGXuk9HX8d6W0rm6wE6hkLxM95ByaSMQlV8RudwohiAFqAl
>>>>>>iy
>>>>>>wV
>>>>>>
>>>>>>wc5yQ8maFSwK7IEhogNVnoUXa-doA3_h7EbSDGq_DW5i_VvC0CFEeaTKtr70A9XgYlAq5
>>>>>>T9
>>>>>>5j
>>>>>>
>>>>>>7mivGO3lXVBTayU2PFNSdnLu8xtQEJJ7YrlmekEYyERmTSQmi7n2wZlmnG2puxVkegQKW
>>>>>>Np
>>>>>>dz
>>>>>> OH4xCfgROnNCnAI
>>>>>> is responding slowly. 12,449 ms elapsed.
>>>>>>
>>>>>> We¹ll continue to get these warnings for rpc calls for many
>>>>>>different
>>>>>> gadgets, the amount of time elapsed will grow, and ultimately every
>>>>>> gadget
>>>>>> render slows to a crawl.
>>>>>>
>>>>>> Some other relevant information:
>>>>>> * We have implemented ³throttling² logic in our own custom
>>>>>>HttpFetcher
>>>>>> which extends the BasicHttpFetcher.  Basically, what this does, is
>>>>>>keep
>>>>>> track of how many outgoing requests are happening for a given url,
>>>>>>and
>>>>>> if
>>>>>> there are too many concurrent ones going at once, it will start
>>>>>> rejecting
>>>>>> outgoing requests.  This was done to avoid a situation where an
>>>>>>external
>>>>>> service is responding slowly and ties up all of shindig¹s external
>>>>>>http
>>>>>> connections.  In our case, I believe that because our rpc endpoint
>>>>>>is
>>>>>> taking so long to respond, we start rejecting these requests with
>>>>>>our
>>>>>> throttling logic.
>>>>>>
>>>>>> I have tried to trace through the rpc calls inside the shindig code
>>>>>> starting in the RpcServlet, and as best I can tell, these rpc calls
>>>>>>are
>>>>>> used for:
>>>>>> * getting viewer data
>>>>>> * getting application data
>>>>>> * anything else?
>>>>>>
>>>>>> I¹ve also looked at the BasicHTTPFetcher, but nothing stands out at
>>>>>>me
>>>>>> at
>>>>>> first glance that would cause such a difference in performance
>>>>>>between
>>>>>> environments if, as our sys admins say, they are the same.
>>>>>>
>>>>>> Additionally, I¹ve ensured that the database table which contains
>>>>>>our
>>>>>> Application Data has been indexed properly (by person ID and gadget
>>>>>>url)
>>>>>> and that person data is cached.
>>>>>>
>>>>>> Any other ideas, or areas in the codebase to explore are very much
>>>>>> appreciated.
>>>>>>
>>>>>> Thanks!
>>>>>> -Matt
>>>>
>>

Re: Performance problems with opensocial rpc calls

Reply via email to