Hey all,

I have a somewhat strange (or maybe not so strange) question regarding 
datacenter management at the hardware and software level. For some context: I 
have recently found myself in charge of on-site maintenance for a datacenter 
with 800+ servers. While the job itself is pretty simple as far as the RAID 
arrays and general hardware configuration is concerned there has been some 
drama regarding past technicians who weren't actually keeping track of 
anything. So I have piles of parts that may or may not be good, servers that 
are completely undocumented, and a grotesque mismatch of labeling schemes for 
the various ethernet/fiber cables and server types.

Does anyone here who works with SMB scale datacenter environments have any tips 
or industry standard strategies for wrangling this type of setup? Are there any 
good FOSS software tools to help organize and monitor a mess like this? We have 
a software team that keeps and eye on the applications, but they do not appear 
to be monitoring things like power consumption, temperature, or even tracking 
parts as they get re-used. Our server "map" is literally just a Google Sheets 
document that was formatted to look like server rows with IP addresses listed 
by physical location. And I'm pretty sure everyone hates it. So I'm basically 
looking for tools to help me set up the following infrastructure:

- server documentation. Type, hardware configuration, and parts compatibility
- temp monitoring. Many of the servers are running CUDA applications on 
Dual/Quad GPU systems. They get toasty.
- power consumption monitoring. Our PDUs are able to report usage via a network 
interface, but nobody ever bothered to set it up. Would be nice to have a 
dashboard that shows when one of the servers freaks out and trips the breaker.

Thoughts? Solutions? Apps? I'm just looking for ideas at the moment. Everything 
is running (or so I'm told) but we currently have a bus number of 1 which is 
obviously a recipe for disaster. I don't mind piecing together my own set of 
scripts and utilities but if something already exists that does the work for 
me, even better :)
-Ben

Reply via email to