Re: [Wikitech-l] 📈 Wikimedia production errors help

Tyler Cipriani Tue, 15 Sep 2020 08:28:44 -0700

Hi!

Thanks for the feedback, this is useful information.

On Tue, Sep 15, 2020 at 3:00 AM Niklas Laxström
<niklas.laxst...@gmail.com> wrote:
> ma 14. syysk. 2020 klo 23.49 Tyler Cipriani (tcipri...@wikimedia.org) 
> kirjoitti:
> If there is an increase in the amount of real new issues and/or
> decrease in the amount of issues fixed, then I would be worried. Given
> what I said above, it's difficult to see if this is the case.

Indeed, a trendline for production quality is difficult to compare if
a large backlog is being added.

> Regardless, I do agree that we should aim to minimize production
> errors to make it easier to spot any new issues. I would encourage all
> maintainers and development teams to ensure that they have a regular
> process to check if they have and triage any production issues in code
> they maintain.

+100 to checking for production errors. It's my hope that folks who
have code that is going out on a train are:

1. Aware their code is going to production that week
2. Watching for related logs and alerts (where possible)
3. Performing other software quality assurance activities on their
code as it rolls out (manual testing, for example)

My assessment of risk as a person deploying software to production is
necessarily linked to my view into quality assurance activities. If
production errors are growing, I worry about sustainability. The
production error dashboard's past stability has provided assurances
about shared awareness and priority of a given week's deployment.

That is, I know there are software quality activities that take place
sometime after code hits group0 or group1 or group2; however, much of
that activity remains opaque. This is why this dashboard is crucial
for deployment.

Having the explicit assurances of folks whose code is going to
production that week would be preferable to any inference I can make
from this dashboard. It's my hope that maintainers and teams triaging
and grooming this dashboard will create an emergent process that can
be used to provide real insight. That is, if we all are keeping this
dashboard up-to-date collectively, it will be easier to see when
quality assurance activities have taken place. Further, if we
collectively fret over this dashboard then we'll share a collective
awareness of anomalies.

> Ending with a question: do we want to have both frontend and backend
> errors on the same tag/board, or should they be on separate ones?

That's a good question. I think that having a single workboard is nice
as there are reporting features[0] that provide some insights about
the overall health of production. Those insights are, as evidenced,
only as good as their inputs, but they remain valuable to me.
Additionally, a single tag may be used in saved searches and custom
dashboards to make it easy to stay on top of issues seen in production
(is my hope which may not align with how folks triage in practice).

Thanks for the feedback. This anomaly makes more sense to me than it did :)

-- Tyler

[0]: <https://phabricator.wikimedia.org/project/reports/1055/>

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] 📈 Wikimedia production errors help

Reply via email to