Hi Robert, Thanks for your analysis. Could you take some time to add your recommendations to the document perhaps?
Some more comments below. On 11-10-23 06:02 PM, Robert Collins wrote: > Thanks for putting so much time into this Francis! I'm *very* glad we > took the time to understand the issue rather than addressing the > symptoms. Something I'd really like to see is a reduction in the > overall number: I think there is an impact in having so many zomg bugs > (and we do have that many bugs: ignore the labels, the issue is real). > As I read the first pivot table 1/3 of the filed bugs do not get > fixed, and about 2/3rds of that is legacy bugs. > > So what this says is '77% of the *increase* in criticals is due to > long standing existing defects' : its tech debt that we're *finally* > paying off. If we were closing fewer of the legacy criticals our > increase would be substantially higher. > > So while I'm *totally* behind a renewed focus on performance - totally > totally totally - I wonder if perhaps performance and legacy bugs are > similar in that they are both debt we're paying for now - schemas, > code structure (lazy evalution), incomplete implementations etc. > Performance bugs perhaps want more schema changes, but equally other > correctness bugs need schema work too. Yes, I agree with this analysis. I suggested focusing on performance for two reasons: 1) insufficient_scaling is the individual category with the most bugs falling under it (24%, next one is missing_integration_test at 14%) 2) they are very easy to identify But I agree, that any work spent on difficult areas (badly factored, spotty test coverage, etc) is probably worthwhile as it will remove pre-emptively a bunch of bugs meeting our Critical criteria from being fixed. It's just that performance problems are very easy to spot, and we have several well-known patterns on how to address them. > > maintenance+support squads together are paying 14/29=48% of the > tech-debt listed as 'legacy', and doing that is taking 14/22=63% of > their combined output. To stay on top of the legacy critical bug > source then, we need a 100% increase in the legacy fix rate and that > isn't available from the existing maintenance squads no matter whether > we ask them to drop other sources of criticals or not. If we did not > have maintenance added criticals (6 items) and that translated 1:1 > into legacy fixes we'd still be short 9 legacy bugfixes to keep the > legacy component flat. > > So this says to me, we are really mining things we didn't do well > enough in the past, and it takes long enough to fix each one, that > until we hit the bottom of the mine, its going to be a standing > feature for us. Yes, I agree with that characterisation. But I would be hard-pressed to change the ratio between feature vs maintenance. While addressing tech-debt is important for the growth of the project, we also need to make changes to make sure that project is still relevant in the evolving landscape. > > I agree with the recommendations to spend some more effort on the > safety nets of testing; the decreased use of doctests and increases in > unit tests should aid with maintenance overhead and avoiding known > problems is generally a positive thing. The SOA initiative will also > help us decouple things as we go which should help with > maintainability and reaction times. Again, I agree. I'd really like TDD to be used as standard, but that's very hard to "enforce" in a distributed environment. > > What troubles me a bit is the unknown size of the legacy mine, and > that from the analysis we added 25% of the legacy volume criticals > from feature work. The great news is that all the ones you examined > were all fixed. I'd like us to make sure though, that we don't end up > adding performance debt - which can be particularly hard to fix. > > The numbers don't really say we're safe from this - 26% of criticals > coming from changes (feature + maintenance) - is a large amount, and > features in particular are not just tweaking things, they are making > large changes, which adds up to a lot of risk. Actually, you should probably add-up the thunderdome category to this (6%) since that was kind of mini-feature sprint in itself. That makes that 33% of the new criticals are introduced as part of major new work. > There are two aspects > to the feature rotation that have been worrying me for a while; one is > performance testing of new work (browser performance, data scaling - > the works), the other is that we rotate off right after users get the > feature. I think we should allow 10% of the feature time, or something > like that, so that after-release-adoption issues can be fixed from the > resources allocated to the feature. One way to do this would be to say > that: > - After release feature squads spend 1-2 weeks doing polish and/or > general bugs (in the area, or even just criticals->high etc). At the > end of that time, they stay on the feature, doing this same stuff, > until all the critical bugs introduced/uncovered by the feature work > feature are fixed. If understand this correctly, you are saying that, the maintenance squad shouldn't not start a new feature, until the feature squad ready to take their place have fixed all Criticals related to the feature (with a minimum of 2 weeks to uncover issues)? I think it's probably worth a try. It would be a relatively low-impact way of tweaking the feature vs maintenance ratio. > > For the performance side, we could make performance/scalablity testing > a release criteria: we already agree that all pages done during a > feature must have a <1 sec 99th percentile and 5 second timeout. > Extending this to say that we've tested those pages with large > datasets would be a modest tweak and likely catch issues. That's something that Matthew and Diogo can add to the release checklist. Are we enforcing the 5seconds timeout in any way at this stage? > > I think its ok that criticals found a few weeks later be handled by > the maintenance squads, which will include the erstwhile feature squad > that triggered them, but we should account for the majority of the > feature-related criticals in the resourcing of the feature - scaling > issues in particular can be curly and require weeks of work, something > maintenance mode, with its interrupts etc, is not suited to. And our > velocity measurements shouldn't be higher by not counting that work as > part of the feature :) > Agreed and your 2-weeks+ wind down period addresses that. Cheers -- Francis J. Lacoste francis.laco...@canonical.com
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : launchpad-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp