Based on what you have mentioned in the other threads, I'd say go for 
something like Zabbix (originally suggested by MC).
This gets you:
1) monitoring with agent and snmp (along with alerting etc.)
        - This gets you the power, temp and network monitoring
2) Inventory including for networks
        - While Zabbix does have automated inventory, you will have to populate 
the rack charts.
3) Dashboards

I will absolutely dissuade folks from rolling your own, we've done 
something like this before just for integration into our other 
infrastructure.
-Eldo

On 3/1/24 21:36, Ben Koenig wrote:
> Hey all,
> 
> I have a somewhat strange (or maybe not so strange) question regarding 
> datacenter management at the hardware and software level. For some context: I 
> have recently found myself in charge of on-site maintenance for a datacenter 
> with 800+ servers. While the job itself is pretty simple as far as the RAID 
> arrays and general hardware configuration is concerned there has been some 
> drama regarding past technicians who weren't actually keeping track of 
> anything. So I have piles of parts that may or may not be good, servers that 
> are completely undocumented, and a grotesque mismatch of labeling schemes for 
> the various ethernet/fiber cables and server types.
> 
> Does anyone here who works with SMB scale datacenter environments have any 
> tips or industry standard strategies for wrangling this type of setup? Are 
> there any good FOSS software tools to help organize and monitor a mess like 
> this? We have a software team that keeps and eye on the applications, but 
> they do not appear to be monitoring things like power consumption, 
> temperature, or even tracking parts as they get re-used. Our server "map" is 
> literally just a Google Sheets document that was formatted to look like 
> server rows with IP addresses listed by physical location. And I'm pretty 
> sure everyone hates it. So I'm basically looking for tools to help me set up 
> the following infrastructure:
> 
> - server documentation. Type, hardware configuration, and parts compatibility
> - temp monitoring. Many of the servers are running CUDA applications on 
> Dual/Quad GPU systems. They get toasty.
> - power consumption monitoring. Our PDUs are able to report usage via a 
> network interface, but nobody ever bothered to set it up. Would be nice to 
> have a dashboard that shows when one of the servers freaks out and trips the 
> breaker.
> 
> Thoughts? Solutions? Apps? I'm just looking for ideas at the moment. 
> Everything is running (or so I'm told) but we currently have a bus number of 
> 1 which is obviously a recipe for disaster. I don't mind piecing together my 
> own set of scripts and utilities but if something already exists that does 
> the work for me, even better :)
> -Ben

Reply via email to