On Fri, 13 Jan 2023 at 17:42, Loris Bennett <loris.benn...@fu-berlin.de> wrote:
> What do you do if the monitoring does detect that the changelog is > becoming to large? Hi Loris (also, nice to know about LU-14699, thanks Colin), We also monitor this, it gets pulled into influxdb via a hacktastic bit of python subprocess (lctl get_param mdd.*.changelog_users) and I just use a grafana alert to trigger when the backlog hits 5M - arbitrary value, but gives us a "oh, I'd better go and look" notice. Having the rate of change plotted too gives us an idea if it's transient (users doing a chunk of IO) or just growing linearly (robinhood stopped responding) Given it only takes us 2-3 hours to scan the filesystem, if something's really out of sync, stop robinhood, drop database, create database, lctl changelog clear ... ... 0 and restart robinhood (twice, once to start the reader and a one-shot scan) is often quicker than poking around. We purely use it for purging /scratch and usage reports on everything else, so no concern about maintaining HSM state or other critical things. I think we've only had to do the drastic step of nuking the changelog info from the MDT (mounted as ldiskfs) twice, and that's been on our (very soon to be decommissioned) 2.5.x filesystem - both of which have been when the changelogs wrapped *and* had an MDT failover Andrew _______________________________________________ robinhood-support mailing list robinhood-support@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/robinhood-support