Hello,

Today we have disabled BigBrother in Toolforge. BigBrother was a tool that monitored continuous jobs that failed to get restarted because they ran into corner cases where Grid Engine wasn't sufficiently smart to re-start them (e.g. out of memory). BigBrother would continuously monitor those jobs and duplicate that functionality on a layer above Grid Engine.

Although very few tools used BigBrother (0.65% to be more precise), it taxed our NFS file server constantly so keeping it around didn't make much sense. Additionally, its functionality could be easily implemented with a shell script running from cron.

So we've converted all tools that had a .bigbrotherrc file to using a bigbrother.sh script that is triggered every 5min to restart jobs. If your tool used BigBrother, please check your crontab (`crontab -l`) and will see a few entries like this:

```
# Ensure continuous jobs are running
*/5 * * * * jlocal /data/project/tool_name/bigbrother.sh job_name job_script
```

Documentation has also been updated to reflect this change: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Bigbrother_(Deprecated)

In our tests everything worked fine but please let us know if your tool is being impacted by this change.

Regards,

--
Giovanni Tirloni
Operations Engineer
Wikimedia Cloud Services

_______________________________________________
Wikimedia Cloud Services announce mailing list
[email protected] (formerly [email protected])
https://lists.wikimedia.org/mailman/listinfo/cloud-announce
_______________________________________________
Wikimedia Cloud Services mailing list
[email protected] (formerly [email protected])
https://lists.wikimedia.org/mailman/listinfo/cloud

Reply via email to