Hi all,

I just packaged up the AWS SSM agent [1], which is a cool system for
automated management of fleets of machines both in AWS and outside of it,
allowing you to run commands on all of them, check "inventory" across all
of them automatically, set policies on disparate types of machines, and so

NixOS seems to work fine with it and I can run commands on it and keep an
eye on the current NixOS release by injecting a fake lsb_release into its
path. But one of the features of SSM is the ability to take an inventory of
"installed" packages on a system. Of course, that notion doesn't directly
make sense in NixOS, but it got me wondering what sorts of metrics might
make sense from a "keep an eye on your fleet of NixOS systems" perspective.

Some possibilities:

   1. Track runtime dependencies of the system root, and ideally maintain
   an external mapping of all of those hashes to expressions that produce
   them. The first part I know how to do, but the second part seems tricky.
   2. Monitor "GC state" of your NixOS system: count how many unreferenced
   derivations are in the store and how much disk space past system
   generations retain (factoring in hard linking and such)
   3. Dump current systemd unit state (broader than just NixOS, obviously)
   4. Track total time spent building derivations and downloading
   substitutes: could be helpful to understand that some of your machines
   aren't accessing your binary cache properly. Perhaps also a "binary cache
   hit rate" metric.

Does anyone have others? If you manage a large fleet of NixOS machines (and
possibly other types of OSes too, so NixOps might not be suitable), which
metrics do you find useful? Even if you do use NixOps to manage the state
of your machines, ongoing metrics can still be useful for assessing the
health of your systems. You don't want to be surprised by a machine's drive
filling up because its store is full of junk :)


[1] https://aws.amazon.com/ec2/systems-manager/
nix-dev mailing list

Reply via email to