Hi, Is there an easy way to stop jsub's "failed to redirect job output" error messages I receive in my mailbox during the NFS outage, ideally, for all of the ~50 jobs I have scheduled for my tools?
Unfortunately, currently I receive dozens of mails every hours. Martin On Mon, Apr 3, 2023, 1:24 AM Andrew Bogott <abog...@wikimedia.org> wrote: > Reminder: The first of these outages will start in about 30 minutes. > Toolforge NFS will be read-only for as long as 18-19 hours. > > > > On 3/29/23 2:17 PM, Andrew Bogott wrote: > > There will be two major Toolforge outages this coming week. Each outage > will cause tool downtime and may require manual restarts afterwards. > > The first outage is an NFS migration [0] and will take place on Monday, > beginning at around 0:00 UTC and lasting well into the day, possibly as > late as 19:00 UTC. During this long period, Toolforge NFS will be > read-only. This will cause most tools (for example, anything that writes a > log file) to fail. > > The second outage will be a database migration [1] and will take place on > Thursday at 17:00UTC. During this window ToolsDB will be read-only. This > migration should take about an hour but unexpected side-effects may extend > the downtime. > > We try very hard to avoid outages of this magnitude, but at this point we > need to choose downtime over the increasing risk of data loss. > > More details can be found below. > > > [0] NFS Outage and system reboots Monday: The existing toolforge NFS > server is running on aging hardware and lacks a straightforward path for > maintenance or upgrading. To improve this we are moving NFS to a cinder+VM > platform which should support easier upgrades, migrations, and expansions > in the future. In order to maintain data integrity during the migration, > the old server will need to be made read-only while the last set of file > changes is synchronized with the new server. Because the NFS service is > actively used, it will take many hours to complete the final sync. > > To ensure stable mounts of the new server, every node in Toolforge will > be rebooted as part of this migration. That means that even tools which do > not use NFS will be affected, although most tools should restart gracefully. > > This task is documented as https://phabricator.wikimedia.org/T333477. > > > [1] DB outage Thursday: As part of the ongoing effort to upgrade > user-created Toolforge databases, we will migrate ToolsDB to a new VM > that will have a more recent version of Debian and MariaDB and will use a > more resilient storage solution. > > The new VM is ready, and we plan to point all tools to use it on *Apr, 6 > 2023 at 17:00 UTC*. > > This will involve about *1 hour of read-only time* for the database. Any > existing database connection will be terminated, and if your tool does not > reconnect automatically you might have to restart it manually. > > An email will be sent shortly before starting the migration, and when it's > finished. > > Please also make sure your tool is connecting to the database using the > canonical hostname *tools.db.svc.wikimedia.cloud* and not any other > hostname or IP address. > > For more details, and to report any issue, you can read or leave a comment > at https://phabricator.wikimedia.org/T333471 > > For more context you can also check out the parent task > https://phabricator.wikimedia.org/T301949 > > > _______________________________________________ > Cloud-announce mailing list -- cloud-annou...@lists.wikimedia.org > List information: > https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/ > _______________________________________________ > Cloud mailing list -- cloud@lists.wikimedia.org > List information: > https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/ >
_______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/