Yep, that’s where I’m headed next. Obviously there’s some hesitation to do that on the part of our product owners so it takes a while to get to that point.
Will let you know what I find. Thanks! -Matt On 7/18/14, 8:44 AM, "Ryan Baxter" <rbaxte...@gmail.com> wrote: >Matt, I think further investigation is warranted. I really think you >need to find a way to trace through the code and find where the >slowdown is occurring. That will help us narrow down what the problem >is. I know it is production, but getting some code on there that >starts timing method calls and such can be very useful. > >On Tue, Jul 15, 2014 at 3:04 PM, Merrill, Matt <mmerr...@mitre.org> wrote: >> Hi Ryan, >> >> Thanks for responding! >> >> I’ve attached our ehcacheConfig, however, comparing it to the default >> configuration the only difference is the overall amount of elements >>(10000 >> in ours vs 1000 in default) and also the temp disk store location. >> >> I’m assuming you are asking if each user in our system has the exact >>same >> set of gadgets to render, correct? If that’s the case: different users >> have different sets of gadgets, however, many of them have a default set >> we give them when they are initially setup in our system. So, many >>people >> will hit the same gadgets over and over again. This default subset of >> gadgets is about 10-12 different gadgets and that is by and large what >> many users have. >> >> However, we have a total of 48 different gadgets that could be rendered >>by >> a user at any given time on this instance of shindig. We do run another >> instance of shindig which could render a different subset of gadgets, >>but >> that has a much lower usage and only renders about 10 different gadgets >> altogether. >> >> >> I am admittedly rusty with my ehCache configuration knowledge, but >>here’s >> a couple things I noticed: >> * I notice that the maxBytesLocalHeap in the ehCacheConfig is 50mb, >>which >> seems low, however, this is the same setting we had in shindig 2.0, so I >> have to wonder if that has anything to do with it. >> * Our old ehCache configuration for shindig 2.0 specified a defaultCache >> maxElementsInMemory of 1000 but NO sizeOfPolicy at all. >> * Our new ehCache configuration for shindig 2.5 specifies a sizeOfPolicy >> maxDepth of 10000 but NO defaultCache maxElementsInMemory. >> >> Our heap sizes in tomcat are 2048mb which based on a 50m max heap for a >> cache seems adequate. This is the same heap size from when we were using >> shindig 2.0. Unfortunately, we don’t have profiling tools enabled on >>our >> Tomcat instances so i can’t see what the heap looked like when things >> crashed, and like I said, we’re unable to reproduce this in int. >> >> I think we might be on to something here… I will keep searching but if >>any >> devs out there have any ideas, please let me know. >> >> Thanks shindig list! >> -Matt >> >> On 7/13/14, 10:12 AM, "Ryan Baxter" <rbaxte...@gmail.com> wrote: >> >>>Matt can you tell us more about how you have configured the caches in >>>shindig? When you are rendering these gadgets are you rendering the >>>same >>>gadget across all users? >>> >>>-Ryan >>> >>>> On Jul 9, 2014, at 3:31 PM, "Merrill, Matt" <mmerr...@mitre.org> >>>>wrote: >>>> >>>> Stanton, >>>> >>>> Thanks for responding! >>>> >>>> This is one instance of shindig. >>>> >>>> If you mean the configuration within the container and for the shindig >>>> java app, then yes, the locked domains are the same. In fact, the >>>> configuration with the exception of shindig¹s host URL¹s is exactly >>>>the >>>> same from what I can tell. >>>> >>>> Unfortunately, I don¹t have any way to trace that exact message, but I >>>>did >>>> do a traceroute from the server running shindig to the URL that is >>>>being >>>> called for rpc calls to make sure there weren¹t any extra network >>>>hops, >>>> and there weren¹t, it actually only had one, as expected for an app >>>>making >>>> an HTTP call to itself. >>>> >>>> Thanks again for responding. >>>> >>>> -Matt >>>> >>>>> On 7/9/14, 3:08 PM, "Stanton Sievers" <ssiev...@apache.org> wrote: >>>>> >>>>> Hi Matt, >>>>> >>>>> Is the configuration for locked domains and security tokens >>>>>consistent >>>>> between your test and production environments? >>>>> >>>>> Do you have any way of tracing the request in the log entry you >>>>>provided >>>>> through the network? Is this a single Shindig server or is there any >>>>>load >>>>> balancing occurring? >>>>> >>>>> Regards, >>>>> -Stanton >>>>> >>>>> >>>>>> On Wed, Jul 9, 2014 at 2:40 PM, Merrill, Matt <mmerr...@mitre.org> >>>>>>wrote: >>>>>> >>>>>> Hi shindig devs, >>>>>> >>>>>> We are in the process of upgrading from shindig 2.0 to 2.5-update1 >>>>>>and >>>>>> everything has gone ok, however, once we got into our production >>>>>> environment, we are seeing significant slowdowns for the opensocial >>>>>>RPC >>>>>> calls that shindig makes to itself when rendering a gadget. >>>>>> >>>>>> This is obviously very dependent on how we¹ve implemented the >>>>>>shindig >>>>>> interfaces in our own code, and also our infrastructure, however, so >>>>>> we¹re >>>>>> hoping someone on the list can help give us some more ideas for >>>>>>areas >>>>>>to >>>>>> investigate inside shindig itself or in general. >>>>>> >>>>>> Here¹s what¹s happening: >>>>>> * Gadgets load fine when the app is not experiencing much load (< 10 >>>>>> users >>>>>> rendering 10-12 gadgets on a page) >>>>>> * Once a reasonable subset of users begins rendering gadgets, gadget >>>>>> render calls through the ³ifr² endpoint start taking a very long >>>>>>time >>>>>>to >>>>>> respond >>>>>> * The problem gets worse from there >>>>>> * Even with extensive load testing we can¹t recreate this problem in >>>>>>our >>>>>> testing environments >>>>>> * Our system adminstrators have assured us that the configurations >>>>>>of >>>>>> our >>>>>> servers are the same between int and prod >>>>>> >>>>>> This is an example of what we¹re seeing from the logs inside >>>>>> BasicHttpFetcher: >>>>>> >>>>>> >>>>>> >>>>>>http://238redacteddnsprefix234.gadgetsv2.company.com:7001/gmodules/rp >>>>>>c? >>>>>>st >>>>>> >>>>>>=mycontainer%3AvY2rb-teGXuk9HX8d6W0rm6wE6hkLxM95ByaSMQlV8RudwohiAFqAl >>>>>>iy >>>>>>wV >>>>>> >>>>>>wc5yQ8maFSwK7IEhogNVnoUXa-doA3_h7EbSDGq_DW5i_VvC0CFEeaTKtr70A9XgYlAq5 >>>>>>T9 >>>>>>5j >>>>>> >>>>>>7mivGO3lXVBTayU2PFNSdnLu8xtQEJJ7YrlmekEYyERmTSQmi7n2wZlmnG2puxVkegQKW >>>>>>Np >>>>>>dz >>>>>> OH4xCfgROnNCnAI >>>>>> is responding slowly. 12,449 ms elapsed. >>>>>> >>>>>> We¹ll continue to get these warnings for rpc calls for many >>>>>>different >>>>>> gadgets, the amount of time elapsed will grow, and ultimately every >>>>>> gadget >>>>>> render slows to a crawl. >>>>>> >>>>>> Some other relevant information: >>>>>> * We have implemented ³throttling² logic in our own custom >>>>>>HttpFetcher >>>>>> which extends the BasicHttpFetcher. Basically, what this does, is >>>>>>keep >>>>>> track of how many outgoing requests are happening for a given url, >>>>>>and >>>>>> if >>>>>> there are too many concurrent ones going at once, it will start >>>>>> rejecting >>>>>> outgoing requests. This was done to avoid a situation where an >>>>>>external >>>>>> service is responding slowly and ties up all of shindig¹s external >>>>>>http >>>>>> connections. In our case, I believe that because our rpc endpoint >>>>>>is >>>>>> taking so long to respond, we start rejecting these requests with >>>>>>our >>>>>> throttling logic. >>>>>> >>>>>> I have tried to trace through the rpc calls inside the shindig code >>>>>> starting in the RpcServlet, and as best I can tell, these rpc calls >>>>>>are >>>>>> used for: >>>>>> * getting viewer data >>>>>> * getting application data >>>>>> * anything else? >>>>>> >>>>>> I¹ve also looked at the BasicHTTPFetcher, but nothing stands out at >>>>>>me >>>>>> at >>>>>> first glance that would cause such a difference in performance >>>>>>between >>>>>> environments if, as our sys admins say, they are the same. >>>>>> >>>>>> Additionally, I¹ve ensured that the database table which contains >>>>>>our >>>>>> Application Data has been indexed properly (by person ID and gadget >>>>>>url) >>>>>> and that person data is cached. >>>>>> >>>>>> Any other ideas, or areas in the codebase to explore are very much >>>>>> appreciated. >>>>>> >>>>>> Thanks! >>>>>> -Matt >>>> >>