Re: [Canonical-ci-engineering] proposal for next sprint

Paul Larson Wed, 06 May 2015 14:29:22 -0700

On Mon, May 4, 2015 at 3:49 PM, Thomi Richards
>
>
> I think we all understood the importance of logging data, but kind of
> dropped the ball on stats data. In the future, I think we should think of
> "logging & metrics" as being integral parts of developing a new system.
> Lesson learned!
>
I think we knew it would be important, but we didn't yet know what needed
to be measured. I'm not sure we still know *everything* that needs
measurement, but it's worth noting that many of the measurements that would
have been really useful were not so much about continuing operations, but
about comparisons between the old and new system. This should have been
called out as a gap in the acceptance criteria.  I do agree that some
operational measurements in the future are useful too, but who monitors
those? Do we alert on any of them? The big ones that comes to mind could
mostly be solved by the queue stats that Celso worked on I think unless we
are called to revamp the solution later down the road and need to
understand the performance characteristics again.  I think the bigger hole
for the moment is alerting, and having a good place to send those with a
clear path on how to resolve them. ex. deadletter queue flood, big spike in
queue depth, etc.


>
> I'd love to get some more information on ELK plugins. I don't have much
> experience with elasticsearch, and the little bit I tried to do (backing up
> and restoring elasticsearch when we migrated the elk deployment to
> production) proved to be tricky.
>
Unless we are collecting for a limited duration to analyze performance, I
think we should avoid any requirement for long running metrics.  Then the
monitoring becomes a critical production service in it's own right - and I
think unnecessarily in this case.

Prometheus does look pretty cool at first glance, but I haven't looked at
it in any depth yet. I think it's worth a spike to investigate strengths
and weakness vs. elk to determine if one or both fit our needs better.
This could *certainly* be useful for future projects. For existing ones, I
will assume that retrofitting stats on them is a new story and should be
approached from the not from the idea of "how do we prove this is better
than X" but "How do we know when there is a problem in the system, and
ensure that we have the right data to know what's going on so someone can
respond to it quickly?"

-- 
Mailing list: https://launchpad.net/~canonical-ci-engineering
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~canonical-ci-engineering
More help   : https://help.launchpad.net/ListHelp

Re: [Canonical-ci-engineering] proposal for next sprint

Reply via email to