Thanks Roman. Adding Oved for Infra visibility. We have a lot to gain here.
On Thu, Oct 29, 2015 at 3:28 PM, Roman Mohr <[email protected]> wrote: > > > On Fri, Oct 2, 2015 at 4:24 PM, Michal Skrivanek < > [email protected]> wrote: > >> >> On 2 Oct 2015, at 12:47, Roman Mohr wrote: >> >> Hi All, >> >> I am contributing to the engine for three months now. While I dug into >> the code I >> started to wonder how to visualize what the engine is actually doing. >> >> >> This is one of the main problems with large application, anything to help >> to understand what's going on is very welcome >> >> >> To get better insights I added hystrix[1] to the engine. Hystrix is a >> circuit >> breaker library which was developed by Netflix and has one pretty >> interesting >> feature: Real time metrics for commands. >> >> In combination with hystrix-dashboard[2] it allows very interesting >> insights. >> You can easily get an overview of commands involved in operations, their >> performance and complexity. Look at [2] and the attachments in [5] and >> [6] for >> screenshots to get an Impression. >> >> I want to propose to integrate hystrix permanently because from my >> perspective >> the results were really useful and I also had some good experiences with >> hystrix >> in past projects. >> >> A first implementation can be found on gerrit[3]. >> >> # Where is it immediately useful? >> >> During development and QA. >> >> An example: I tested the hystrix integration on /api/vms and /api/hosts >> rest >> endpoints and immediately saw that the number of command exectuions grew >> lineary whit the number of vms and hosts. The bug reports [5] and [6] are >> the >> result. >> >> # How to monitor the engine? >> >> It is as easy as starting a hystrix-dashboard [2] with >> >> $ git clone https://github.com/Netflix/Hystrix.git >> $ cd Hystrix/hystrix-dashboard >> $ ../gradlew jettyRun >> >> and point the dashboard to >> >> https://<customer.engine.ip>/ovirt-engine/hystrix.stream. >> >> # Other possible benefits? >> >> * Live metrics at customer site for admins, consultants and support. >> >> * Historical metrics for analysis in addition to the log files. >> The metrics information is directly usable in graphite [7]. Therefore >> it would be >> possible to collect the json stream for a certain time period and >> analyze them >> later like in [4]. To do that someone just has to run >> >> curl --user admin@internal:engine >> http://localhost:8080/ovirt-engine/api/hystrix.stream > hystrix.stream >> >> for as long as necessary. The results can be analyzed later. >> >> >> +1 >> it's a great idea and when properly documented so even a BFU can do that >> it would allow us to get much better idea when something is not working or >> working too slow on a system we don't have access to, but it\'s >> reproducible elsewhere. Just ask for "hey, run this thingie while you are >> reproducing the issue and send us the result" >> >> >> # Possible architectural benefits? >> >> In addition to the live metrics we might also have use for the real >> hystrix features: >> >> * Circuit Breaker >> * Bulk execution of commands >> * De-dublication of commands (Caching) >> * Synchronous and asynchronous execution support >> * .. >> >> Our commands do already have a lot of features, so I don't think that >> there are >> some quick wins, but maybe there are interesting opportunities for infra. >> >> >> eh..I would worry about that much later. First we should understand what >> are we actually doing and why (as we all know the engine is likely doing a >> lot of useless stuff;-) >> >> >> # Overhead? >> >> In [5] the netflix employees describe their results regarding the >> overhead of >> wrapping every command into a new instance of a hystrix command. >> >> They ran their tests on a standard 4-core Amazon EC2 server with a load >> of 60 >> request per second. >> >> When using threadpools they measured a mean overhead of less than one >> millisecond (so negligible). At the 90th percentile they measured an >> overhead >> of 3 ms. At the 99th percentile of about 9 ms. >> >> >> This is likely good enough for backend commands and REST entry points (as >> you currently did), but may need more careful examination if we would want >> to add this to e.g. thread pool allocations >> Don't get slowed down by that though, even for higher level stuff it is a >> great source of information >> >> >> When configuring the hystrix commands to use semaphores instead of >> threadpools >> they are even faster. >> >> # How to integrate? >> >> A working implementation can be found on gerrit[3]. These patch sets >> wrap a >> hystrix command around every VdcAction, every VdcQuery and every >> VDSCommand. >> This just required four small modifications in the code base. >> >> # Security? >> >> In the provided patches the hystrix-metrics-servlet is accessible at >> /ovirt-engine/api/hystrix.stream. It is protected by basic auth but >> accessible >> for everyone who can authenticate. We should probably restrict it to >> admins. >> >> >> that would be great if it doesn't require too much work. If it does then >> we can start with enabling/disabling via JMX using Roy's recent patch [8] >> >> > The hystrix stream is now accessible in > http://<host>/ovirt-engine/services/hystrix.stream > and admin privileges are needed. > Further it can be enabled an disabled via JMX (disabled by default). > @Juan, @Roy thank you for your feedback on the code. > >> >> # Todo? >> >> 1) We do report failed actions with return values. Hystrix expects failing >> commands to throw an exception. So on the dashboard almost every command >> looks >> like a success. To overcome this, it would be pretty easy to throw an >> exception inside the command and catch it immediately after it leaves the >> hystrix wrapper. >> >> >> at the beginning it's probably enough to see what stuff is getting >> called, without differentiating between success or failure (we mostly do >> log failures, so hopefully we know when stuff is broken this way) >> >> > Ok, I leave it disabled for now. But should really be just as easy as to > throw an exception if the command fails and immediately catching it > afterwards (not the nicest looking code then, but would work). And this can > be encapsulated in the command executor, so it would not pollute the > existing code. > > >> >> 2) Finetuning >> Do we want semaphores or a thread pool. When the thread pool, what size >> do we want? >> >> To answer this myself, I use semaphores, to be sure to support > transactions over multiple commands properly. > >> 3) Three unpackaged dependencies: archaius, hystrix-core, hystrix-contrib >> >> >> Since you yesterday volunteered to package them I think this should not >> stop us!:-) >> >> thanks a lot for the effort, I miss a proper analysis for soooo long. >> Thanks for stepping up! >> >> michal >> >> >> # References >> >> [1] https://github.com/Netflix/Hystrix >> [2] https://github.com/Netflix/Hystrix/tree/master/hystrix-dashboard >> [3] https://gerrit.ovirt.org/#/q/topic:hystrix >> [4] >> http://www.nurkiewicz.com/2015/02/storing-months-of-historical-metrics.html >> [5] >> https://github.com/Netflix/Hystrix/wiki/FAQ#what-is-the-processing-overhead-of-using-hystrix >> [5] https://bugzilla.redhat.com/show_bug.cgi?id=1268216 >> [6] https://bugzilla.redhat.com/show_bug.cgi?id=1268224 >> [7] http://graphite.wikidot.com >> >> [8] https://gerrit.ovirt.org/#/c/29693/ >> >> _______________________________________________ >> Devel mailing list >> [email protected] >> http://lists.ovirt.org/mailman/listinfo/devel >> >> >> > > _______________________________________________ > Devel mailing list > [email protected] > http://lists.ovirt.org/mailman/listinfo/devel >
_______________________________________________ Devel mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/devel
