On 5 September 2013 22:14, Kay Schenk <kay.sch...@gmail.com> wrote: > On Wed, Sep 4, 2013 at 9:36 AM, janI <j...@apache.org> wrote: > > > Hi. > > > > We have had some longer discussions on different ML/IRC about how a > > vm-admin should behave and which level of service we expect for our > > servers. > > > > We need new admins, so this is also a request for anyone interested to > chip > > in. > > > > We have had some unfortunate incidents on all 3 vm, of different nature, > > which made me question if we as a community: > > a) want servers, that are cared for professionally or by happening. > > b) want to (are capable to) maintain the servers ourself. > > c) are prepared to support a change that make a), b) possible. > > > > I have formulated some thoughts on how admins could work, but in general > I > > believe we should convince infra to take over the vm responsibility and > > keep our well functioning forum/wiki admins. > > > > We have a vm-team in place, that was created with the purpose of not > having > > a single person as admin. I my opinion the team have not lived up to that > > purpose but I am still thankful for the help I have received. > > > > Remarks the ideas below are my personal thought, which I have used during > > the time where I maintained the servers: > > > > =========== > > The server should at all times be maintained with the following priority: > > 1) security (the backside of being popular is to have the attention of > > people who want to gain merit by breaking our servers) > > 2) stability (we have limited cpu/ram/disk so we must optimize) > > 3) add user wishes (we already have stable systems, 1,2 are far more > > important that enhancing the systems). > > > > Being an admin on a vm is a job that does not take soo much time, but > > requires a lot of monitoring and communication (especially with infra). > > > > A good setup would be, 3 types of admin: > > Each server will have an appointed "owner" (anchor-admin) > > A number of persons have full sudo on a server (admin) > > A number of persons can reboot/restart/work on po files (help-admin) > > > > === Anchor-admin responsibilities === > > Anchor-admin is the "owner" of the vm and the prime contact to infra. > > > > Anchor-admin has the overall responsibility of the vm. > > 1) help when receiving alerts > > 2) keep informed on available patches, especial security related patches > > 3) create/keep a maintenance plan > > 4) coordinate changes external to vm (like dns) with infra > > 5) participate in infra discussions relevant for the vm (e.g. > certificates) > > 6) monitor the vm regularly for resource usage > > 7) secure that appl changes are implemented with relevant consensus > > 8) discuss work with admin, with the goal that they should be able to > take > > over one day. > > > > These activities are expected to take 3-4 hours pr week, more in the > > beginning and less later. The hour usage highly depend on the number and > > level of admins. > > > > === Admin responsibilities === > > Admins help the anchor admin with ongoing maintenance and have full sudo. > > > > All changes must be discussed and agreed with the anchor admin, no change > > is so important that it cannot wait until discussed ! > > > > Admins are expected to: > > 1) help when receiving alerts > > 2) stay informed with the vm configuration > > including but not limited to: > > - where are which configuration done, and stored (svn/backup) > > - how are the apps. configured > > - read and update runbook, if something is unclear > > 3) participate in the regular maintenance > > 4) coordinate all non-scheduled work with anchor-admin > > > > These activities are expected to take 1-2 hours pr week, more in the > > beginning and less later. > > > > Admin does not need to be specialists, we all learn, but it is important > > that the admin have motivation and time to learn. > > > > > > === Help-admin responsibilities === > > Help-admins are located in different timezones, so we have 24/7 coverage > > and have limited sudo (only restart/reboot/handle po files). > > > > When a help-admin receives an alert mail, actions should be taken > > 1) is the vm reachable via ssh, then login else escalate to admin/infra > > 2) is the vm overloaded, or is apache/mysql not running > > 3) restart the needed processes > > 4) mail at least anchor-admin about with obervations and what was done. > > > > > > === > > remark the above are just my thoughts, there are a lot of other > > possibilities. > > > > Lets hear your opinion? > > > > rgds > > jan I. > > > > I would like to discuss this topic further, much further as a matter of > fact, but right now I don't really have enough information. > > Can you provide details on the following 9or point to document that > describes this): > > * to aid our memories, who are the current vm-team > jürgen, andrea, imacat, arist and myself.
> * what are the three servers now under the vm-team > ooo-wiki-vm2.a.o (wiki.openoffice.org), ooo-forums-vm.a.o ( forums.openoffice.org), translate-vm2.a.o Our servers also depend on erebus.a.o which are proxy server for HTTPS. * what vm-OS does each use > ubuntu 12.04 (I have standardized that part). > * for each server, what are the specific applications a vm-sysadmin would > need to know/become familiar with to be an effective sysadmin > for all 3 systems: - ubuntu, especially apt-get, apparmor - httpd, local installation as defined in ASF - php, generic installation - puppet, config as defined in ASF - sshd, config as defined in ASF - svn, usage depend on the single server, but in general all static changes are defined here - apbackup, as used by ASF - memcached - mysql - /root/bin, helper scripts - security applications, as defined in ASF (details are on purpose not given to a public list). For ooo-wiki: - wikimedia - ATS For ooo-forums - php2bb (remark multiforum setup with links) For translate - pootle - django * how are alerts on system failure currently handled > Nagios and circonus standard setup. Detected alerts goes to #asfinfra, infra-team and vm-team. > * what resources would a vm-admin need to respond to a system failure > ??? I am not sure I understand what you mean. help-admin, would restart/reboot system admin, would locate problem, try to fix it > > > Your role outline is good, but I think before we discuss future strategy, > we need a better idea about what's involved. > Or maybe we need someone that are interested before discussing theoretically. The Items I listed above, should all be obvious to people with SA experience. We can dream about strategies, but if we dont have volunteers, or the people that volunteered dont do the job, its seems to be wrong way. The vm-team was defined for that purpose, but I dont think the vm-team, apart from me, have responded to a single alert. Arist helped me a lot with changing mysql on forum, but that about all the help I have received. Our current problem, is much more that the current admins do what they like to do, instead of following an organized plan. We do not need more people, we need people who care and do something as a team. rgds jan I. > > > > > > > > -- > > ------------------------------------------------------------------------------------------------- > MzK > > "Truth is stranger than fiction, but it is because Fiction is obliged > to stick to possibilities. Truth isn't." > -- "Following the Equator", Mark Twain >