Re: [Canonical-ci-engineering] proposal for next sprint

Joe Talbott Thu, 07 May 2015 13:04:48 -0700

The best part for me is I can go back and read this thread again if I need
to.  Yay mailinglists.


Get off my lawn,
Joe

On Thu, May 7, 2015 at 3:39 PM, Francis Ginther <
[email protected]> wrote:

> On Thu, May 7, 2015 at 11:02 AM, Ursula Junque <
> [email protected]> wrote:
>
>> Hi,
>>
>> On Thu, May 7, 2015 at 12:14 PM, Celso Providelo <
>> [email protected]> wrote:
>>
>>> Hi guys,
>>>
>>> Let me include some extra information to this thread that might help us
>>> to solve some obstacles for achieving better visibility about our systems.
>>>
>>> The pattern we have established for pushing rich-events via
>>> python-logstash has been serving us very well, I'd be wary of initiatives
>>> to retrofit working services with anything else without identifying exactly
>>> what we are missing with this approach. Specially because we can leverage
>>> logstash capabilities to hub events conditionally to other systems
>>> statsd/graphite, IRC and PagerDuty [1].
>>>
>>> There are undeniable visualisation limitations with kibana3 and that's
>>> the possible motivation for thinking about Graphite and Prometheus,
>>> however, frankly speaking, they look pretty much the same from the features
>>> they provide and will require extra infrastructure deployment (and
>>> maintenance) for providing better visualisation of the data we already have
>>> (and can easily augment, if necessary). If the problem is indeed only
>>> visualisation, let's evaluate kibana4 [2], which would provide a much
>>> smoother migration path.
>>>
>>
> I wasn't aware kibana4 had these capabilities, thanks for providing the
> insight. I find there are several aspects of kibana that I'm unaware of. I
> get that it is a data aggregator, but I feel I'm only scratching the
> surface of what I could be doing with it. I'm very interested in any
> service that lets me push gobs of blobs of data to it so that I can then
> pick and choose what I do with it later.
>
>
>> Moreover, I feel that we are only scratching the surface of ELK in terms
>>> of capabilities to provide answers to our current problems, and this idea
>>> that Graphite would give us free-(lunch-)metrics is not entirely true, they
>>> still have to be built/modelled in the services and in Graphite, i.e. it's
>>> more about figuring out what we want to see than how to do it.
>>>
>>
>>
>> Just my two cents: I agree it's fundamental to understand the problems
>> we're trying to solve before diving into solution details, and I believe
>> that's the tiny missing bit in this discussion. That said, I think
>> situations like this are the right moments to look into different
>> technologies. For example: if the issue now is indeed visualization, we
>> don't necessarily have to be limited to kibana 4, but use spikes to
>> investigate that and also alternatives we want to evaluate, like the ones
>> suggested in this thread. Timeboxed efforts ftw. :)
>>
>
> I think this thread is getting us to the point of understanding the
> problems or at least bringing them to the surface. We had to start
> somewhere.
>
>
>>
>>> Let's talk about the problems we are trying to solve with metrics ...
>>> From what we have already experienced and you have reported, we would like
>>> to 1) visualise *some* (not clear to me yet) performance/duration
>>> iterations and also 2) be alerted about misbehaving/malfunctioning units.
>>>
>>> First, let's agree that they are distinct problems.
>>>
>>
> Yep.
>
> Performance visualisation on heterogeneous tasks is a complex problem,
>>> despite of the tool kibana, graphite or prometheus, even if we push
>>> individual steps duration (extra['duration']) on events, I am struggling to
>>> see how we could make a lot of sense of these data in a periodic series
>>> without being restricted to filtering individual sources (even though it
>>> would be tied to increase/shrink of tests). Anyway, it would be much
>>> cheaper to push extra data in the existing events and see how they could be
>>> combined/visualised in kibana and maybe that's the most efficient and
>>> useful experiment/spike we could at this point.
>>>
>>
> Indeed monitoring something like average test time for packages going
> through proposed migration is meaningless to me as well, but using the same
> data to determine the percentage of time that a worker is busy is
> meaningful. That's the kind of data I would want to know before changing
> the scaling of a service. Perhaps solving this kind of problem is not a
> priority right now, that's ok. But when the time comes, I sure would like
> to be able to add that metric quickly if I didn't already have it.
>
>
>> Alerting is something we are completely missing,  we depend on someone to
>>> access kibana, interpret the graphs and act if it is the case. So problems
>>> goes unnoticed every time. I, personally, think this is a much more
>>> pressing issue to be tackled and, as pointed above, do not depend on any
>>> new infrastructure, just extending LS configuration.
>>>
>>> Despite of the umbrella-check-retry done in result-checker, I think we
>>> are interested in alerts for *all* ERROR events from units. We could get
>>> those via IRC, PD or Email (I think we should decide the which media suits
>>> us better during the spike-story). This way the vanguard person would be
>>> alerted and act upon any:
>>>
>>>  * Spurious failures (e.g. glance client cached connection timeout ->
>>> should be fixed)
>>>  * Unit failure, even if it was retried locally or by the result-checker
>>> (not visible to users), it is still a problem to be fixed in code (block
>>> new deployment promotion) or one of the myriad of possible environment
>>> problems that could prevent a worker to deliver its results (more on this
>>> below)
>>>  * Ultimately a deadletter-ed message, which would a be a problem
>>> visible to user
>>>
>>> How does it sound ? First we get aware of the problems in a active way,
>>> then with that data we decide how can we pro-actively prevent them.
>>>
>>> This task looks small and objective enough to fit in a spike-story and
>>> would move us consistently forward on this subject.
>>>
>>
> I really think alerting is a completely orthogonal topic. It's a good
> topic, but it does deserve its own spike and its own priority.
>
>
>>
>>> Let me briefly give you my take on this wish of monitoring *everything*
>>> hoping they will someday matter, like tenant quota usage, individual unit
>>> raw-data (disk, cpu, mem, etc). This comes from the Newtonian Root-Cause
>>> Analysis mindset we were taught our entire life [3], but this
>>> simplification becomes suboptimal for complex systems, where solving the
>>> 'root-cause' often uncover new problems with new effects not monitored
>>> before, i.e. monitoring all the possible cause-effect combinations becomes
>>> expensive because of their unpredictable relationships, we will never
>>> monitor enough to prevent problems.
>>>
>>
> I get that it is futile to monitor everything. But at the same time I
> don't understand why it's pointless to monitoring anything before there is
> a problem. We built these systems, we have intuition where it's inefficient
> and where our customers will have complaints ("Why isn't it faster?"). In
> my head, there are already problems we are aware of, we just don't know to
> what degree they are a problem. They fact that they are distributed systems
> doesn't change that for me. This is my view of (1) above.
>
> If your argument is that we're better off investing effort in
> stopping/solving the underlying problems, which I consider topic (2), ...
> (read on)
>
> A practical example is the keypair leaking from uci-nova when port quota
>>> is exhausted. While monitoring keypair quota looks useful for identifying
>>> that there is a leak, unfortunately it would not point us to the *real*
>>> cause of the problem, we would still need a human to interpret results and
>>> decide how to sort it out and meanwhile the problem would escalate and end
>>> up affecting the service availability.
>>>
>>> For instance, instead of passively trying to collect isolated data and
>>> hope a human would show up to sort it out quickly, we could simply
>>> kill/stop cloud-workers units that resulted in exit_code 16, that would
>>> arguably decrease system throughput, but would control/cease the damage
>>> without exposing unavailability to users while we analyse and solve the
>>> problem.
>>>
>>> This is just one example of how I think we should operate systems with
>>> this level of complexity, instead of trying to model complex and
>>> unpredictable cause - effect pairs, we buy time to perform deep analysis
>>> and work on fixes by isolating/removing problematic units ...
>>>
>>
> ... I'm fully behind this idea. What is preventing us from implementing
> these circuit breakers? What do we need to finish around the external
> pieces of the problem and connect errors to alerts? Can we develop our next
> set of services to 'self-destruct' when they hit one of these errors? Do we
> need a spike story here to do anything or are there some well understood
> improvements we could implement right away? I think this is fundamentally a
> distinct problem from "real-time stats monitoring" which is where this
> thread started. I think both are important, but I'm a little worn out from
> writing this to think about which one is more important at the moment :-).
>
> Francis
> --
> Francis Ginther
> Canonical - Ubuntu Engineering - Continuous Integration Team
>
> --
> Mailing list: https://launchpad.net/~canonical-ci-engineering
> Post to     : [email protected]
> Unsubscribe : https://launchpad.net/~canonical-ci-engineering
> More help   : https://help.launchpad.net/ListHelp
>
>

-- 
Mailing list: https://launchpad.net/~canonical-ci-engineering
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~canonical-ci-engineering
More help   : https://help.launchpad.net/ListHelp

Re: [Canonical-ci-engineering] proposal for next sprint

Reply via email to