Re: [Wikitech-l] Gerrit Commit Wars

Ryan Lane Fri, 07 Mar 2014 16:05:52 -0800

On Fri, Mar 7, 2014 at 2:54 PM, Tyler Romeo <tylerro...@gmail.com> wrote:


> On Fri, Mar 7, 2014 at 5:39 PM, George Herbert <george.herb...@gmail.com
> >wrote:
>
> > With all due respect; hell, yes, development comes in second to
> operational
> > stability.
> >
> > This is not disrespecting development, which is extremely important by
> any
> > measure.  But we're running a top-10 worldwide website, a key worldwide
> > information resource for humanity as a whole.  We cannot cripple
> > development to try and maximize stability, but stability has to be
> priority
> > 1.  Any large website's teams will have the same attitude.
> >
> > I've had operational outages reach the top of everyone's news
> > source/feed/newspaper/broadcast.  This is an exceptionally unpleasant
> > experience.
> >
>
> If you really think stability is top priority, then you cannot possibly
> think that the current deployment process is sane.
>
>
Developers shouldn't be blocked on deployment or operations. Development is
expensive and things will break either way. It's good to assume things will
break and:

1. Have a simple way to revert
2. Put tests in for common errors
3. Have post-mortems where information is kept for historical purposes and
bugs are created to track action items that come from them


> Right now you are placing the responsibility on the developers to make sure
> the site is stable, because any change they merge might break production
> since it is automatically sent out. If anything that gives the appearance
> that the operations team doesn't care about stability, and would rather
> wait until things break and revert them.
>
>
Yes! This is a _good_ thing. Developers should feel responsible for what
they build. It's shouldn't be operation's job to make sure the site is
stable for code changes. Things should go more in this direction, in fact.

I'm not totally sure what you mean by "it's automatically sent out",
though. Deploys are manual.


> It is the responsibility of the operations team to ensure stability. Having
> to revert something because that's the only way production will be stable
> is not a proper workflow.
>
>
It's the responsibility of the operations team to ensure stability at the
infrastructure level, not at the application level. It's sane to expect to
revert things because things will break no matter what. Mean time to
recovery is just as important or more important than mean time between
failure.

- Ryan
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Gerrit Commit Wars

Reply via email to