On 2/21/15 5:34 PM, Ricordisamoa wrote:
Thanks! I know you're doing your best to deal with outages and
performance issues.
Out of curiosity, do you foresee the Foundation allocating some more
dedicated people/hardware for Labs?
We have only just added a third full-time engineer, Yuvi. My preference
going forward is to distribute labs knowledge more widely through the
Ops team so that there /many/ more people available to help in a pinch.
We've been documenting and scripting as much as we can to facilitate
that... if everything is still falling to just the three of us a few
months from now then we can start lobbying for a fourth dedicated engineer.
Labs isn't especially constrained by hardware limitations; it's much
more a question of human bandwidth to adequately manage what hardware we
have. The foundation has been quick to fund Labs hardware requests when
we make them -- the pain is generally in transition and management
rather than actual limited financial resources. Case in point: a shiny
new pile of hard drives is the /cause/ of the outage in the subject line :)
-Andrew
Il 22/02/2015 01:57, Andrew Bogott ha scritto:
On 2/20/15 8:07 AM, Ricordisamoa wrote:
Thank you.
I (and probably many others) would like someone from the Ops team to
elaborate on the uptime and general reliability Labs (especially
Tools) is supposed to have, and for what kind of services it is
suitable for, to prevent future misunderstandings in regards to loss
of important work, etc.
Hello!
I don't want to ignore your question, but I also don't exactly know
how to answer it. We're very unlikely to be able to project any kind
of future uptime percentage, because currently labs runs on few
enough servers that any attempt to predict uptime by multiplying
failure rates by server counts would produce such giant error bars as
to be useless.
Nonetheless, I can recap our uptime and storage vulnerabilities so
that you know what to be wary of.
Bad news:
- Each labs instance is stored on a single server. If any one server
is destroyed in a catastrophe (e.g. hard-drive crash, blow from a
pickaxe, etc.) the state of all contained VMs will be suspended or,
in extreme cases, lost. [1]
- There are three full-time Operations staff-members dedicated to
supporting labs. We don't cover all timezones perfectly, and
sometimes we take weekends and vacations. [2]
- Although the Tools grid engine is distributed among many instances
(and, consequently, many physical servers), actual tools usage relies
on several single points of failure, the most obvious of which is the
web proxy. [3]
- All of labs currently lives in a single datacenter. It's a very
dependable datacenter, but nonetheless vulnerable to cable cuts,
fires, and other local disaster scenarios. [4]
Good news:
- Problems like the Ghost vulnerability which mandated a reboot of
all hardware in late January are very rare.
- The cause of the outage on Tuesday was quite bad (and quite
unusual), and we nevertheless were able to recover from it without
data loss.
https://wikitech.wikimedia.org/wiki/Incident_documentation/20150217-LabsOutage
- Yuvi has churned out a ton of great monitoring tools which mean
that we're ever more aware of and responsive to incidents that might
precede outages.
- Use of Labs and Tools is growing like crazy! This means that the
Labs team is stretched a bit thin rushing to keep up, but I have a
hard time thinking of this as bad news.
I'm aware that this response is entirely qualitative, and that you
might prefer some actual quantities and statistics. I'm not
reluctant to provide those, but I simply don't know where to begin.
If you have any specific questions that would help address your
particular concerns, please don't hesitate to ask.
-Andrew
[1] This is consistent with a 'cattle, not pets' design pattern. For
example, all tools instances are fully puppetized and any lost
instance can be replaced with a few minutes' work. Labs users
outside of the Tools project should hew closely to this design model
as well. This vulnerability could be partially mitigated with
something like https://phabricator.wikimedia.org/T90364 but that has
potential downsides.
Note that data stored on shared NFS servers and in Databases is
highly redundant and much less subject to destruction.
[2] Potential mitigation for this is obvious, but extremely expensive :(
[3] Theoretical mitigation for this is
https://phabricator.wikimedia.org/T89995, for which I would welcome a
Hackathon collaborator
[4] I believe that there are plans in place for backup replication of
NFS and Database data to a second data center; I will let Coren and
Sean comment on the specifics.
_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l
_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l
_______________________________________________
Labs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/labs-l