I have a decent amount of experience in the space, originally in Cisco and
Google DCs and in recent years in the cloud. Here are a few opinionated
suggestions and the order I'd tackle them:

1) First, I'd address configuration management across the fleet. Get every
node enrolled in configuration management of some sort. In my case I always
defaulted to puppet <https://www.puppet.com/> (Ruby DSL) but I've heard
others have great experiences with Ansible <https://www.ansible.com/>
(python). Set up a local development environment that mirrors the servers
using virtual box, vagrant, docker, etc, there are a lot of options in the
space. The main goal is to be capable of developing manifests locally that
you can confidently push out to the fleet. I'd also highly recommend a git
repository (locally hosted if security is of highest concern, in a github
private repo otherwise) so that we always have a rollback strategy for bad
manifests that make it through our local development and QA pipeline. It
will happen and it will be fixable.

2) Next I'd tackle observability. Deploy prometheus node exporter
<https://prometheus.io/docs/guides/node-exporter/> to every node using the
configuration management we configured in step one. This will result in
each server publishing a lightweight metrics page on a port of your
choosing that you will then configure prometheus <https://prometheus.io/>
to scrape the desired IP ranges for those metrics pages. Bonus points if
you can use the same configuration management system (puppet, ansible,
etc.) to configure your prometheus server and store those manifests in git.
If you're a hacker-man sort, set up prometheus alertmanager
<https://prometheus.io/docs/alerting/latest/alertmanager/>. If you prefer a
little more fit and finish deploy grafana <https://grafana.com/> and
integrate with pagerduty <https://www.pagerduty.com/>. If logs are of value
you can configure Loki <https://grafana.com/oss/loki/> to ingest them into
grafana for a single-pane view of infra ops. Depending on your budget you
may be of a good scale to leverage grafana cloud
<https://grafana.com/products/cloud/> or Datadog
<https://www.datadoghq.com/> if you'd prefer to spend more and maintain
less. If you choose this route ElasticCloud <https://www.elastic.co/cloud>
and Splunk
<https://www.splunk.com/en_us/products/splunk-cloud-platform.html> are both
good options for cloud hosted log aggregation.

3) Now that we have some visibility into the infrastructure it's time to
start documenting. I would bring up an instance of phpIPAM
<https://phpipam.net/demo/>. There's some overlap with prometheus in
functionality but I find having both at my disposal is invaluable when
tracking down issues. I'd also begin a sitebook/runbook in the wiki product
of your choice. Notion <https://www.notion.so/> is a popular choice but a
google doc will work in a pinch. Share it with everyone. I'd also introduce
an IT Asset Management (ITAM) system to handle inventory but for better or
worse all of the ones I've used have always been home grown solutions so I
don't have a solid recommendation.

4) Now that we've hit all the basics I'd add on some nice to have's. Are
you responsible for networking? Set up rancid
<https://github.com/haussli/rancid> to collect your configs, find a good
solution for SNMP ingestion, I really like observium
<https://www.observium.org/>. Start capturing data in the wiki of your
choice. Develop a strategy and process for documenting and addressing
outages. IPMI
<https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface>
may be worth investigating but with a mixed fleet I'd explore other
solutions like network accessible PDUs. Some balk at the idea but I love
having a ticketing system that the infra can interact with via API to
auto-document issues. Solutions range from clickup to jira to zendesk to
asana. Many options in the space both cloud and self hosted.

After that you should have a fairly well instrumented infra that provides
good feedback and a solid foundation for your documentation. Now we can
move on to addressing issues that have arisen during the process, making
things more automated (i.e. -- the puppet/ansible server automatically
pulls updated manifests from the master branch of your git repo), or
improving operator QoL in whatever way makes the most sense for you. If you
have the need bringing up a Jenkins <https://www.jenkins.io/> server to act
as DC-level cron and to respond to infra stimulti for automated remediation
before alerting can be a very helpful tool to have.

If I was scoping this project for a full time gig I'd set it at about three
weeks with the personal expectation that if things went well it would take
about two. Hopefully some of this will be useful. Good luck!

On Sat, Mar 2, 2024 at 2:30 PM Ben Koenig <[email protected]> wrote:

> On Saturday, March 2nd, 2024 at 10:50 AM, Ted Mittelstaedt <
> [email protected]> wrote:
>
> > Are these 800 servers virtual or physical?
>
> Physical.
>
> > Are the physical servers home-built or commercial from a major brand (HP
> Proliant, etc.)
>
> Home-built... but often with parts from major brands. Or copy cat brands
>
> > Are the servers all the same brand and model or are they a mismash of
> pieces from different makers?
>
> Uhh.. Ever seen a graphics card with a Gigabyte logo and EVGA silkscreened
> onto the PCB?
>
> > Are the servers yours or owned by customers? That is, if they are
> virtual servers owned by remote customers do you have any responsibility to
> monitor them?>
>
> We own them. And the racks, cabinets, PDUs.
>
> > For "emergency notifications" the go-to for FOSS is "Big Sister"
> https://bigsister.ch/ Set that up to ping the server interface and if it
> trips a breaker and goes offline then have Big Sister email a text-to-SMS
> gateway for your cell phone number
> >
> > For monitoring power consumption you have to configure the PDUs for
> that. I've yet to see one of these that supports current monitoring but
> does not support SNMP, so once you get that going you can monitor power
> consumption with mrtg or, if you want to get fancy, https://www.cacti.net/
> Cacti is based on RRDtool with is the successor to MRTG
> https://oss.oetiker.ch/rrdtool/
> >
>
> The PDUs have SNMP so I may have to take a look at those.
>
> I've used RT in the past and it's a bit on the excessive side. IIRC it
> uses perl and I know next to nothing about perl. As of right now, it
> basically is a one man show, I am the only one regularly on side for the
> physical hardware. That said, they want to hire a second person which is
> where these tools will start to come in handy. Creating a custom tool to
> manage all this stuff is not outside the realm of possibility, but that
> might end up meaning that I spend all my time maintaining said tool.
>
> My instinct is to start setting up some sort of relational database and
> build it up piece by piece simply because there is literally NOTHING used
> to manage this stuff. Especially since the servers are already installed
> and running. But like anything else the first step is to list all options
> and make my list of pros and cons. ;)
>


-- 
Timothy Scoppetta

P: 845-459-3002
E: [email protected]

Reply via email to