Thanks! I know you're doing your best to deal with outages and
performance issues.
Out of curiosity, do you foresee the Foundation allocating some more
dedicated people/hardware for Labs?
Il 22/02/2015 01:57, Andrew Bogott ha scritto:
On 2/20/15 8:07 AM, Ricordisamoa wrote:
Thank you.
I (and probably many others) would like someone from the Ops team to
elaborate on the uptime and general reliability Labs (especially
Tools) is supposed to have, and for what kind of services it is
suitable for, to prevent future misunderstandings in regards to loss
of important work, etc.
Hello!
I don't want to ignore your question, but I also don't exactly know
how to answer it. We're very unlikely to be able to project any kind
of future uptime percentage, because currently labs runs on few enough
servers that any attempt to predict uptime by multiplying failure
rates by server counts would produce such giant error bars as to be
useless.
Nonetheless, I can recap our uptime and storage vulnerabilities so
that you know what to be wary of.
Bad news:
- Each labs instance is stored on a single server. If any one server
is destroyed in a catastrophe (e.g. hard-drive crash, blow from a
pickaxe, etc.) the state of all contained VMs will be suspended or, in
extreme cases, lost. [1]
- There are three full-time Operations staff-members dedicated to
supporting labs. We don't cover all timezones perfectly, and
sometimes we take weekends and vacations. [2]
- Although the Tools grid engine is distributed among many instances
(and, consequently, many physical servers), actual tools usage relies
on several single points of failure, the most obvious of which is the
web proxy. [3]
- All of labs currently lives in a single datacenter. It's a very
dependable datacenter, but nonetheless vulnerable to cable cuts,
fires, and other local disaster scenarios. [4]
Good news:
- Problems like the Ghost vulnerability which mandated a reboot of all
hardware in late January are very rare.
- The cause of the outage on Tuesday was quite bad (and quite
unusual), and we nevertheless were able to recover from it without
data loss.
https://wikitech.wikimedia.org/wiki/Incident_documentation/20150217-LabsOutage
- Yuvi has churned out a ton of great monitoring tools which mean that
we're ever more aware of and responsive to incidents that might
precede outages.
- Use of Labs and Tools is growing like crazy! This means that the
Labs team is stretched a bit thin rushing to keep up, but I have a
hard time thinking of this as bad news.
I'm aware that this response is entirely qualitative, and that you
might prefer some actual quantities and statistics. I'm not reluctant
to provide those, but I simply don't know where to begin. If you have
any specific questions that would help address your particular
concerns, please don't hesitate to ask.
-Andrew
[1] This is consistent with a 'cattle, not pets' design pattern. For
example, all tools instances are fully puppetized and any lost
instance can be replaced with a few minutes' work. Labs users outside
of the Tools project should hew closely to this design model as well.
This vulnerability could be partially mitigated with something like
https://phabricator.wikimedia.org/T90364 but that has potential
downsides.
Note that data stored on shared NFS servers and in Databases is highly
redundant and much less subject to destruction.
[2] Potential mitigation for this is obvious, but extremely expensive :(
[3] Theoretical mitigation for this is
https://phabricator.wikimedia.org/T89995, for which I would welcome a
Hackathon collaborator
[4] I believe that there are plans in place for backup replication of
NFS and Database data to a second data center; I will let Coren and
Sean comment on the specifics.
_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l
_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l