On 7 May 2015 at 16:19, Thomi Richards <[email protected]> wrote: >>> On Thu, May 7, 2015 at 12:14 PM, Celso Providelo >>> <[email protected]> wrote: >>>> Let me briefly give you my take on this wish of monitoring *everything* >>>> hoping they will someday matter, like tenant quota usage, individual unit >>>> raw-data (disk, cpu, mem, etc). This comes from the Newtonian Root-Cause >>>> Analysis mindset we were taught our entire life [3], but this >>>> simplification >>>> becomes suboptimal for complex systems, where solving the 'root-cause' >>>> often >>>> uncover new problems with new effects not monitored before, i.e. monitoring >>>> all the possible cause-effect combinations becomes expensive because of >>>> their unpredictable relationships, we will never monitor enough to prevent >>>> problems.
... > I get that root-cause-analysis for complex systems is hard, but > "cause-analysis" can be easy. In my example, the symptom was "OMG we're > running out of keypairs". The cause is "we're leaking keypairs". Who cares > about the root cause? There is a problem (running out of keypairs), and if > we don't fix it our system will break, and will break _hard_. Let's track > these things which we know, if they ever happen, will break our services. > > We'll never be smart enough to predict all future sources of failure, but we > know from experience that there's a class of failure we can monitor and > alert on. The problem is not that root cause analysis is hard in complex systems. The problem is that there is no root cause of failure in complex systems. Newton's Third Law only works when you can isolate down to singular inputs, but that is necessarily not what we're dealing with when watching a process influenced by its environment (think of the Butterfly Effect). It is non-deterministic. We are only going to create false confidence and waste time if we take the position that the more holes we plug, the safer we will be. There's a myriad of ways that work can end up dead-lettered. Work in such a holding pattern can be investigated by a human (at their leisure) and thrown back onto the queue when the problem is fixed, all without the end user ever noticing. Why bother to isolate and report on each individual variant of these when we will always have to go to the queue and inspect the messages? -- Mailing list: https://launchpad.net/~canonical-ci-engineering Post to : [email protected] Unsubscribe : https://launchpad.net/~canonical-ci-engineering More help : https://help.launchpad.net/ListHelp

