On Sat, 2 Mar 2024, Ben Koenig wrote:

Does anyone here who works with SMB scale datacenter environments have any
tips or industry standard strategies for wrangling this type of setup? Are
there any good FOSS software tools to help organize and monitor a mess
like this? We have a software team that keeps and eye on the applications,
but they do not appear to be monitoring things like power consumption,
temperature, or even tracking parts as they get re-used. Our server "map"
is literally just a Google Sheets document that was formatted to look like
server rows with IP addresses listed by physical location. And I'm pretty
sure everyone hates it. So I'm basically looking for tools to help me set
up the following infrastructure:

- server documentation. Type, hardware configuration, and parts compatibility
- temp monitoring. Many of the servers are running CUDA applications on
  Dual/Quad GPU systems. They get toasty.
- power consumption monitoring. Our PDUs are able to report usage via a
  network interface, but nobody ever bothered to set it up. Would be nice
  to have a dashboard that shows when one of the servers freaks out and
  trips the breaker.
Thoughts? Solutions? Apps? I'm just looking for ideas at the moment.
Everything is running (or so I'm told) but we currently have a bus number
of 1 which is obviously a recipe for disaster. I don't mind piecing
together my own set of scripts and utilities but if something already
exists that does the work for me, even better :)

Ben,

As you well know, this is not in my wheelhouse, but it's very similar to
environmental permit management for large corporations, so I offer a couple
of suggestions for you to ponder.

1. First thing, is write a SOP (Standard Operating Procedures) that
encapsulates all needs of the data center and describes all hardware,
software, and personnel involved. This puts in place a standard that can be
used for new staff, hardware, and software.

2. Develop a database application with a web interface; Postgresql and Flask
or Django (the latter initially developed for on on-line newspaper that's
constantly being updated.) Since temperature and power monitoring are
continuous/continual the database will be updated in real time. There almost
certainly are tools that will automate measurements and insertion in the
appropriate database table.

The initial effort will pay for itself in a short time and make your life
(and that of others in the datacenter and company) much easier.

HTH,

Rich

Reply via email to