The best part for me is I can go back and read this thread again if I need to. Yay mailinglists.
Get off my lawn, Joe On Thu, May 7, 2015 at 3:39 PM, Francis Ginther < [email protected]> wrote: > On Thu, May 7, 2015 at 11:02 AM, Ursula Junque < > [email protected]> wrote: > >> Hi, >> >> On Thu, May 7, 2015 at 12:14 PM, Celso Providelo < >> [email protected]> wrote: >> >>> Hi guys, >>> >>> Let me include some extra information to this thread that might help us >>> to solve some obstacles for achieving better visibility about our systems. >>> >>> The pattern we have established for pushing rich-events via >>> python-logstash has been serving us very well, I'd be wary of initiatives >>> to retrofit working services with anything else without identifying exactly >>> what we are missing with this approach. Specially because we can leverage >>> logstash capabilities to hub events conditionally to other systems >>> statsd/graphite, IRC and PagerDuty [1]. >>> >>> There are undeniable visualisation limitations with kibana3 and that's >>> the possible motivation for thinking about Graphite and Prometheus, >>> however, frankly speaking, they look pretty much the same from the features >>> they provide and will require extra infrastructure deployment (and >>> maintenance) for providing better visualisation of the data we already have >>> (and can easily augment, if necessary). If the problem is indeed only >>> visualisation, let's evaluate kibana4 [2], which would provide a much >>> smoother migration path. >>> >> > I wasn't aware kibana4 had these capabilities, thanks for providing the > insight. I find there are several aspects of kibana that I'm unaware of. I > get that it is a data aggregator, but I feel I'm only scratching the > surface of what I could be doing with it. I'm very interested in any > service that lets me push gobs of blobs of data to it so that I can then > pick and choose what I do with it later. > > >> Moreover, I feel that we are only scratching the surface of ELK in terms >>> of capabilities to provide answers to our current problems, and this idea >>> that Graphite would give us free-(lunch-)metrics is not entirely true, they >>> still have to be built/modelled in the services and in Graphite, i.e. it's >>> more about figuring out what we want to see than how to do it. >>> >> >> >> Just my two cents: I agree it's fundamental to understand the problems >> we're trying to solve before diving into solution details, and I believe >> that's the tiny missing bit in this discussion. That said, I think >> situations like this are the right moments to look into different >> technologies. For example: if the issue now is indeed visualization, we >> don't necessarily have to be limited to kibana 4, but use spikes to >> investigate that and also alternatives we want to evaluate, like the ones >> suggested in this thread. Timeboxed efforts ftw. :) >> > > I think this thread is getting us to the point of understanding the > problems or at least bringing them to the surface. We had to start > somewhere. > > >> >>> Let's talk about the problems we are trying to solve with metrics ... >>> From what we have already experienced and you have reported, we would like >>> to 1) visualise *some* (not clear to me yet) performance/duration >>> iterations and also 2) be alerted about misbehaving/malfunctioning units. >>> >>> First, let's agree that they are distinct problems. >>> >> > Yep. > > Performance visualisation on heterogeneous tasks is a complex problem, >>> despite of the tool kibana, graphite or prometheus, even if we push >>> individual steps duration (extra['duration']) on events, I am struggling to >>> see how we could make a lot of sense of these data in a periodic series >>> without being restricted to filtering individual sources (even though it >>> would be tied to increase/shrink of tests). Anyway, it would be much >>> cheaper to push extra data in the existing events and see how they could be >>> combined/visualised in kibana and maybe that's the most efficient and >>> useful experiment/spike we could at this point. >>> >> > Indeed monitoring something like average test time for packages going > through proposed migration is meaningless to me as well, but using the same > data to determine the percentage of time that a worker is busy is > meaningful. That's the kind of data I would want to know before changing > the scaling of a service. Perhaps solving this kind of problem is not a > priority right now, that's ok. But when the time comes, I sure would like > to be able to add that metric quickly if I didn't already have it. > > >> Alerting is something we are completely missing, we depend on someone to >>> access kibana, interpret the graphs and act if it is the case. So problems >>> goes unnoticed every time. I, personally, think this is a much more >>> pressing issue to be tackled and, as pointed above, do not depend on any >>> new infrastructure, just extending LS configuration. >>> >>> Despite of the umbrella-check-retry done in result-checker, I think we >>> are interested in alerts for *all* ERROR events from units. We could get >>> those via IRC, PD or Email (I think we should decide the which media suits >>> us better during the spike-story). This way the vanguard person would be >>> alerted and act upon any: >>> >>> * Spurious failures (e.g. glance client cached connection timeout -> >>> should be fixed) >>> * Unit failure, even if it was retried locally or by the result-checker >>> (not visible to users), it is still a problem to be fixed in code (block >>> new deployment promotion) or one of the myriad of possible environment >>> problems that could prevent a worker to deliver its results (more on this >>> below) >>> * Ultimately a deadletter-ed message, which would a be a problem >>> visible to user >>> >>> How does it sound ? First we get aware of the problems in a active way, >>> then with that data we decide how can we pro-actively prevent them. >>> >>> This task looks small and objective enough to fit in a spike-story and >>> would move us consistently forward on this subject. >>> >> > I really think alerting is a completely orthogonal topic. It's a good > topic, but it does deserve its own spike and its own priority. > > >> >>> Let me briefly give you my take on this wish of monitoring *everything* >>> hoping they will someday matter, like tenant quota usage, individual unit >>> raw-data (disk, cpu, mem, etc). This comes from the Newtonian Root-Cause >>> Analysis mindset we were taught our entire life [3], but this >>> simplification becomes suboptimal for complex systems, where solving the >>> 'root-cause' often uncover new problems with new effects not monitored >>> before, i.e. monitoring all the possible cause-effect combinations becomes >>> expensive because of their unpredictable relationships, we will never >>> monitor enough to prevent problems. >>> >> > I get that it is futile to monitor everything. But at the same time I > don't understand why it's pointless to monitoring anything before there is > a problem. We built these systems, we have intuition where it's inefficient > and where our customers will have complaints ("Why isn't it faster?"). In > my head, there are already problems we are aware of, we just don't know to > what degree they are a problem. They fact that they are distributed systems > doesn't change that for me. This is my view of (1) above. > > If your argument is that we're better off investing effort in > stopping/solving the underlying problems, which I consider topic (2), ... > (read on) > > A practical example is the keypair leaking from uci-nova when port quota >>> is exhausted. While monitoring keypair quota looks useful for identifying >>> that there is a leak, unfortunately it would not point us to the *real* >>> cause of the problem, we would still need a human to interpret results and >>> decide how to sort it out and meanwhile the problem would escalate and end >>> up affecting the service availability. >>> >>> For instance, instead of passively trying to collect isolated data and >>> hope a human would show up to sort it out quickly, we could simply >>> kill/stop cloud-workers units that resulted in exit_code 16, that would >>> arguably decrease system throughput, but would control/cease the damage >>> without exposing unavailability to users while we analyse and solve the >>> problem. >>> >>> This is just one example of how I think we should operate systems with >>> this level of complexity, instead of trying to model complex and >>> unpredictable cause - effect pairs, we buy time to perform deep analysis >>> and work on fixes by isolating/removing problematic units ... >>> >> > ... I'm fully behind this idea. What is preventing us from implementing > these circuit breakers? What do we need to finish around the external > pieces of the problem and connect errors to alerts? Can we develop our next > set of services to 'self-destruct' when they hit one of these errors? Do we > need a spike story here to do anything or are there some well understood > improvements we could implement right away? I think this is fundamentally a > distinct problem from "real-time stats monitoring" which is where this > thread started. I think both are important, but I'm a little worn out from > writing this to think about which one is more important at the moment :-). > > Francis > -- > Francis Ginther > Canonical - Ubuntu Engineering - Continuous Integration Team > > -- > Mailing list: https://launchpad.net/~canonical-ci-engineering > Post to : [email protected] > Unsubscribe : https://launchpad.net/~canonical-ci-engineering > More help : https://help.launchpad.net/ListHelp > >
-- Mailing list: https://launchpad.net/~canonical-ci-engineering Post to : [email protected] Unsubscribe : https://launchpad.net/~canonical-ci-engineering More help : https://help.launchpad.net/ListHelp

