Hi.

We have had some longer discussions on different ML/IRC about how a
vm-admin should behave and which level of service we expect for our servers.

We need new admins, so this is also a request for anyone interested to chip
in.

We have had some unfortunate incidents on all 3 vm, of different nature,
which made me question if we as a community:
a) want servers, that are cared for professionally or by happening.
b) want to (are capable to) maintain the servers ourself.
c) are prepared to support a change that make a), b) possible.

I have formulated some thoughts on how admins could work, but in general I
believe we should convince infra to take over the vm responsibility and
keep our well functioning forum/wiki admins.

We have a vm-team in place, that was created with the purpose of not having
a single person as admin. I my opinion the team have not lived up to that
purpose but I am still thankful for the help I have received.

Remarks the ideas below are my personal thought, which I have used during
the time where I maintained the servers:

===========
The server should at all times be maintained with the following priority:
1) security (the backside of being popular is to have the attention of
people who want to gain merit by breaking our servers)
2) stability (we have limited cpu/ram/disk so we must optimize)
3) add user wishes (we already have stable systems, 1,2 are far  more
important that enhancing the systems).

Being an admin on a vm is a job that does not take soo much time, but
requires a lot of monitoring and communication (especially with infra).

A good setup would be, 3 types of admin:
Each server will have an appointed "owner" (anchor-admin)
A number of persons have full sudo on a server (admin)
A number of persons can reboot/restart/work on po files (help-admin)

=== Anchor-admin responsibilities ===
Anchor-admin is the "owner" of the vm and the prime contact to infra.

Anchor-admin has the overall responsibility of the vm.
1) help when receiving alerts
2) keep informed on available patches, especial security related patches
3) create/keep a maintenance plan
4) coordinate changes external to vm (like dns) with infra
5) participate in infra discussions relevant for the vm (e.g. certificates)
6) monitor the vm regularly for resource usage
7) secure that appl changes are implemented with relevant consensus
8) discuss work with admin, with the goal that they should be able to take
over one day.

These activities are expected to take 3-4 hours pr week, more in the
beginning and less later. The hour usage highly depend on the number and
level of admins.

=== Admin responsibilities ===
Admins help the anchor admin with ongoing maintenance and have full sudo.

All changes must be discussed and agreed with the anchor admin, no change
is so important that it cannot wait until discussed !

Admins are expected to:
1) help when receiving alerts
2) stay informed with the vm configuration
including but not limited to:
- where are which configuration done, and stored (svn/backup)
- how are the apps. configured
- read and update runbook, if something is unclear
3) participate in the regular maintenance
4) coordinate all non-scheduled work with anchor-admin

These activities are expected to take 1-2 hours pr week, more in the
beginning and less later.

Admin does not need to be specialists, we all learn, but it is important
that the admin have motivation and time to learn.


=== Help-admin responsibilities ===
Help-admins are located in different timezones, so we have 24/7 coverage
and have limited sudo (only restart/reboot/handle po files).

When a help-admin receives an alert mail, actions should be taken
1) is the vm reachable via ssh, then login else escalate to admin/infra
2) is the vm overloaded, or is apache/mysql not running
3) restart the needed processes
4) mail at least anchor-admin about with obervations and what was done.


===
remark the above are just my thoughts, there are a lot of other
possibilities.

Lets hear your opinion?

rgds
jan I.

Reply via email to