On Wed, May 06, 2015 at 10:07:31PM -0500, Francis Ginther wrote: > On Wed, May 6, 2015 at 8:11 PM, Thomi Richards <[email protected] > > wrote: > > > Hi, > > > > On Thu, May 7, 2015 at 9:28 AM, Paul Larson <[email protected]> > > wrote: > > > >> > >> I think we knew it would be important, but we didn't yet know what needed > >> to be measured. > >> > > > > Indeed. I don't mean to suggest that I (or anyone) can name all the stats > > we want to track for any new system we build - these things will evolve and > > change over time. Let's try and capture that in an acceptance criteria: It > > seems like we're saying that whatever we deploy must be able to pick up > > new/changed metrics easily... I'm thinking along the lines of "I just > > re-deploy the service(s) with the new stats enabled and it "just works" - > > no need to configure multiple things". So, as a first stab, how about some > > acceptance criteria like: > > > > * System must be able to respond to newly exposed metrics without > > requiring manual configuration? > > > > -or perhaps- > > > > * Must be able to track additional metrics, or change existing metrics > > without needing to change anything other than the service(s) effected? > > > > Thanks for starting to define the criteria. > > > > ugh - writing these criteria is hard. Suggestions more than welcome - I > > think my suggestions above missed the mark somewhat :D > > > > > > > >> I'm not sure we still know *everything* that needs measurement, but it's > >> worth noting that many of the measurements that would have been really > >> useful were not so much about continuing operations, but about comparisons > >> between the old and new system. This should have been called out as a gap > >> in the acceptance criteria. I do agree that some operational measurements > >> in the future are useful too, but who monitors those? Do we alert on any of > >> them? > >> > > > > Agreed. How about: > > > > * The system must support alerting via PagerDuty ? > > > > I didn't consider alerting before, but it's an interesting idea. > > > > The big ones that comes to mind could mostly be solved by the queue stats > >> that Celso worked on > >> > > > > I'm not sure what that is? > > > > > >> I think unless we are called to revamp the solution later down the road > >> and need to understand the performance characteristics again. I think the > >> bigger hole for the moment is alerting, and having a good place to send > >> those with a clear path on how to resolve them. ex. deadletter queue flood, > >> big spike in queue depth, etc. > >> > > > > Agreed. I think though that we want a system that can track these things > > long term without any sort of cognative overload. So, we track the > > performance metrics now (because we're handing the system over), but we > > also continue to track those metrics forwever: we want a system that > > doesn't degrade as we track more and more things. We NEVER want to be in a > > situation where we're saying "OK, let's stop tracking X, Y, Z, to make room > > for these new metrics". For an acceptance criteria, how about something > > like: > > > > * The system must be able to track metrics without displaying them. > > * The system must be able to track a large number of metrics without > > degrading performance (ugh - please help me re-write this). > > > > > >> > >>> I'd love to get some more information on ELK plugins. I don't have much > >>> experience with elasticsearch, and the little bit I tried to do (backing > >>> up > >>> and restoring elasticsearch when we migrated the elk deployment to > >>> production) proved to be tricky. > >>> > >> Unless we are collecting for a limited duration to analyze performance, I > >> think we should avoid any requirement for long running metrics. Then the > >> monitoring becomes a critical production service in it's own right - and I > >> think unnecessarily in this case. > >> > >> > I have to disagree on part of this. I think it's important that we collect > long running statistics. This is going to be our tool for going to IS (or > whoever) and requesting more resources. We'll need to be able to backup our > request for 20 more BBBs with some data to show that it's going to meet the > demand. We also need long running statistics to establish a baseline. For > example, right now it may take 2 minutes for uci-nova to setup the testbed, > if we sometime later notice this is now taking 5 minutes, we should be > better able to find the regression. > > I do agree that the statistics service should not be required to keep > services running. Just like the logging solution, data should be thrown > over the wall at it and if it's not there, nothing should break. > > > > hmmmm. I don't think there's anything wrong with gathering a metric that > > we don't actively monitor. It's nice to be able to look back at historical > > data when something goes wrong - To be able to answer the question "How > > does this service's performance compare with last week's?". > > > > I imagine we'll collect 20-30 metrics from most services, but probably > > only actively monitor 2 or 3, *until something goes wrong*, at which point > > having those additional numbers can be a real boon to debugging / problem > > solving. > > > > That's what I expect as well. I can see us going overboard at first and > measuring things that end up being meaningless, just like we have log > messages that we realize are irrelevant over time. And that's fine. > Measuring a meaningless thing should not have a noticeably adverse impact > on the system so that we can safely ignore these forever if necessary > (you've already covered this as a criteria above).
One thing to consider is the cost of implementing measurements that are meaningless. I think we are trying to get an interesting and potentially useful bit of engineering work in that we don't *know* we need yet. I think Thomi is spot on regarding the quota usage monitoring. But I think we should *just* implement that bit and add metrics as we identify their utility. Remember start small and build out from there. Thanks, Joe -- Mailing list: https://launchpad.net/~canonical-ci-engineering Post to : [email protected] Unsubscribe : https://launchpad.net/~canonical-ci-engineering More help : https://help.launchpad.net/ListHelp

