I have a decent amount of experience in the space, originally in Cisco and Google DCs and in recent years in the cloud. Here are a few opinionated suggestions and the order I'd tackle them:
1) First, I'd address configuration management across the fleet. Get every node enrolled in configuration management of some sort. In my case I always defaulted to puppet <https://www.puppet.com/> (Ruby DSL) but I've heard others have great experiences with Ansible <https://www.ansible.com/> (python). Set up a local development environment that mirrors the servers using virtual box, vagrant, docker, etc, there are a lot of options in the space. The main goal is to be capable of developing manifests locally that you can confidently push out to the fleet. I'd also highly recommend a git repository (locally hosted if security is of highest concern, in a github private repo otherwise) so that we always have a rollback strategy for bad manifests that make it through our local development and QA pipeline. It will happen and it will be fixable. 2) Next I'd tackle observability. Deploy prometheus node exporter <https://prometheus.io/docs/guides/node-exporter/> to every node using the configuration management we configured in step one. This will result in each server publishing a lightweight metrics page on a port of your choosing that you will then configure prometheus <https://prometheus.io/> to scrape the desired IP ranges for those metrics pages. Bonus points if you can use the same configuration management system (puppet, ansible, etc.) to configure your prometheus server and store those manifests in git. If you're a hacker-man sort, set up prometheus alertmanager <https://prometheus.io/docs/alerting/latest/alertmanager/>. If you prefer a little more fit and finish deploy grafana <https://grafana.com/> and integrate with pagerduty <https://www.pagerduty.com/>. If logs are of value you can configure Loki <https://grafana.com/oss/loki/> to ingest them into grafana for a single-pane view of infra ops. Depending on your budget you may be of a good scale to leverage grafana cloud <https://grafana.com/products/cloud/> or Datadog <https://www.datadoghq.com/> if you'd prefer to spend more and maintain less. If you choose this route ElasticCloud <https://www.elastic.co/cloud> and Splunk <https://www.splunk.com/en_us/products/splunk-cloud-platform.html> are both good options for cloud hosted log aggregation. 3) Now that we have some visibility into the infrastructure it's time to start documenting. I would bring up an instance of phpIPAM <https://phpipam.net/demo/>. There's some overlap with prometheus in functionality but I find having both at my disposal is invaluable when tracking down issues. I'd also begin a sitebook/runbook in the wiki product of your choice. Notion <https://www.notion.so/> is a popular choice but a google doc will work in a pinch. Share it with everyone. I'd also introduce an IT Asset Management (ITAM) system to handle inventory but for better or worse all of the ones I've used have always been home grown solutions so I don't have a solid recommendation. 4) Now that we've hit all the basics I'd add on some nice to have's. Are you responsible for networking? Set up rancid <https://github.com/haussli/rancid> to collect your configs, find a good solution for SNMP ingestion, I really like observium <https://www.observium.org/>. Start capturing data in the wiki of your choice. Develop a strategy and process for documenting and addressing outages. IPMI <https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface> may be worth investigating but with a mixed fleet I'd explore other solutions like network accessible PDUs. Some balk at the idea but I love having a ticketing system that the infra can interact with via API to auto-document issues. Solutions range from clickup to jira to zendesk to asana. Many options in the space both cloud and self hosted. After that you should have a fairly well instrumented infra that provides good feedback and a solid foundation for your documentation. Now we can move on to addressing issues that have arisen during the process, making things more automated (i.e. -- the puppet/ansible server automatically pulls updated manifests from the master branch of your git repo), or improving operator QoL in whatever way makes the most sense for you. If you have the need bringing up a Jenkins <https://www.jenkins.io/> server to act as DC-level cron and to respond to infra stimulti for automated remediation before alerting can be a very helpful tool to have. If I was scoping this project for a full time gig I'd set it at about three weeks with the personal expectation that if things went well it would take about two. Hopefully some of this will be useful. Good luck! On Sat, Mar 2, 2024 at 2:30 PM Ben Koenig <[email protected]> wrote: > On Saturday, March 2nd, 2024 at 10:50 AM, Ted Mittelstaedt < > [email protected]> wrote: > > > Are these 800 servers virtual or physical? > > Physical. > > > Are the physical servers home-built or commercial from a major brand (HP > Proliant, etc.) > > Home-built... but often with parts from major brands. Or copy cat brands > > > Are the servers all the same brand and model or are they a mismash of > pieces from different makers? > > Uhh.. Ever seen a graphics card with a Gigabyte logo and EVGA silkscreened > onto the PCB? > > > Are the servers yours or owned by customers? That is, if they are > virtual servers owned by remote customers do you have any responsibility to > monitor them?> > > We own them. And the racks, cabinets, PDUs. > > > For "emergency notifications" the go-to for FOSS is "Big Sister" > https://bigsister.ch/ Set that up to ping the server interface and if it > trips a breaker and goes offline then have Big Sister email a text-to-SMS > gateway for your cell phone number > > > > For monitoring power consumption you have to configure the PDUs for > that. I've yet to see one of these that supports current monitoring but > does not support SNMP, so once you get that going you can monitor power > consumption with mrtg or, if you want to get fancy, https://www.cacti.net/ > Cacti is based on RRDtool with is the successor to MRTG > https://oss.oetiker.ch/rrdtool/ > > > > The PDUs have SNMP so I may have to take a look at those. > > I've used RT in the past and it's a bit on the excessive side. IIRC it > uses perl and I know next to nothing about perl. As of right now, it > basically is a one man show, I am the only one regularly on side for the > physical hardware. That said, they want to hire a second person which is > where these tools will start to come in handy. Creating a custom tool to > manage all this stuff is not outside the realm of possibility, but that > might end up meaning that I spend all my time maintaining said tool. > > My instinct is to start setting up some sort of relational database and > build it up piece by piece simply because there is literally NOTHING used > to manage this stuff. Especially since the servers are already installed > and running. But like anything else the first step is to list all options > and make my list of pros and cons. ;) > -- Timothy Scoppetta P: 845-459-3002 E: [email protected]
