On Fri, 13 Jan 2023 at 17:42, Loris Bennett <loris.benn...@fu-berlin.de> wrote:

> What do you do if the monitoring does detect that the changelog is
> becoming to large?


Hi Loris (also, nice to know about LU-14699, thanks Colin),

We also monitor this, it gets pulled into influxdb via a hacktastic
bit of python subprocess (lctl get_param mdd.*.changelog_users) and I
just use a grafana alert to trigger when the backlog hits 5M -
arbitrary value, but gives us a "oh, I'd better go and look" notice.
Having the rate of change plotted too gives us an idea if it's
transient (users doing a chunk of IO) or just growing linearly
(robinhood stopped responding)

Given it only takes us 2-3 hours to scan the filesystem, if
something's really out of sync, stop robinhood,  drop database, create
database, lctl changelog clear ... ... 0 and restart robinhood (twice,
once to start the reader and a one-shot scan) is often quicker than
poking around. We purely use it for purging /scratch and usage reports
on everything else, so no concern about maintaining HSM state or other
critical things.

I think we've only had to do the drastic step of nuking the changelog
info from the MDT (mounted as ldiskfs) twice, and that's been on our
(very soon to be decommissioned) 2.5.x filesystem - both of which have
been when the changelogs wrapped *and* had an MDT failover

Andrew


_______________________________________________
robinhood-support mailing list
robinhood-support@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/robinhood-support

Reply via email to