On Mon, May 4, 2015 at 3:49 PM, Thomi Richards > > > I think we all understood the importance of logging data, but kind of > dropped the ball on stats data. In the future, I think we should think of > "logging & metrics" as being integral parts of developing a new system. > Lesson learned! > I think we knew it would be important, but we didn't yet know what needed to be measured. I'm not sure we still know *everything* that needs measurement, but it's worth noting that many of the measurements that would have been really useful were not so much about continuing operations, but about comparisons between the old and new system. This should have been called out as a gap in the acceptance criteria. I do agree that some operational measurements in the future are useful too, but who monitors those? Do we alert on any of them? The big ones that comes to mind could mostly be solved by the queue stats that Celso worked on I think unless we are called to revamp the solution later down the road and need to understand the performance characteristics again. I think the bigger hole for the moment is alerting, and having a good place to send those with a clear path on how to resolve them. ex. deadletter queue flood, big spike in queue depth, etc.
> > I'd love to get some more information on ELK plugins. I don't have much > experience with elasticsearch, and the little bit I tried to do (backing up > and restoring elasticsearch when we migrated the elk deployment to > production) proved to be tricky. > Unless we are collecting for a limited duration to analyze performance, I think we should avoid any requirement for long running metrics. Then the monitoring becomes a critical production service in it's own right - and I think unnecessarily in this case. Prometheus does look pretty cool at first glance, but I haven't looked at it in any depth yet. I think it's worth a spike to investigate strengths and weakness vs. elk to determine if one or both fit our needs better. This could *certainly* be useful for future projects. For existing ones, I will assume that retrofitting stats on them is a new story and should be approached from the not from the idea of "how do we prove this is better than X" but "How do we know when there is a problem in the system, and ensure that we have the right data to know what's going on so someone can respond to it quickly?"
-- Mailing list: https://launchpad.net/~canonical-ci-engineering Post to : [email protected] Unsubscribe : https://launchpad.net/~canonical-ci-engineering More help : https://help.launchpad.net/ListHelp

