Hey all, I have a somewhat strange (or maybe not so strange) question regarding datacenter management at the hardware and software level. For some context: I have recently found myself in charge of on-site maintenance for a datacenter with 800+ servers. While the job itself is pretty simple as far as the RAID arrays and general hardware configuration is concerned there has been some drama regarding past technicians who weren't actually keeping track of anything. So I have piles of parts that may or may not be good, servers that are completely undocumented, and a grotesque mismatch of labeling schemes for the various ethernet/fiber cables and server types.
Does anyone here who works with SMB scale datacenter environments have any tips or industry standard strategies for wrangling this type of setup? Are there any good FOSS software tools to help organize and monitor a mess like this? We have a software team that keeps and eye on the applications, but they do not appear to be monitoring things like power consumption, temperature, or even tracking parts as they get re-used. Our server "map" is literally just a Google Sheets document that was formatted to look like server rows with IP addresses listed by physical location. And I'm pretty sure everyone hates it. So I'm basically looking for tools to help me set up the following infrastructure: - server documentation. Type, hardware configuration, and parts compatibility - temp monitoring. Many of the servers are running CUDA applications on Dual/Quad GPU systems. They get toasty. - power consumption monitoring. Our PDUs are able to report usage via a network interface, but nobody ever bothered to set it up. Would be nice to have a dashboard that shows when one of the servers freaks out and trips the breaker. Thoughts? Solutions? Apps? I'm just looking for ideas at the moment. Everything is running (or so I'm told) but we currently have a bus number of 1 which is obviously a recipe for disaster. I don't mind piecing together my own set of scripts and utilities but if something already exists that does the work for me, even better :) -Ben
