Hi,
Is there an easy way to stop jsub's "failed to redirect job output" error
messages I receive in my mailbox during the NFS outage, ideally, for all of
the ~50 jobs I have scheduled for my tools?
Unfortunately, currently I receive dozens of mails every hours.
Martin
On Mon, Apr 3, 2023, 1:24 AM Andrew Bogott wrote:
> Reminder: The first of these outages will start in about 30 minutes.
> Toolforge NFS will be read-only for as long as 18-19 hours.
>
>
>
> On 3/29/23 2:17 PM, Andrew Bogott wrote:
>
> There will be two major Toolforge outages this coming week. Each outage
> will cause tool downtime and may require manual restarts afterwards.
>
> The first outage is an NFS migration [0] and will take place on Monday,
> beginning at around 0:00 UTC and lasting well into the day, possibly as
> late as 19:00 UTC. During this long period, Toolforge NFS will be
> read-only. This will cause most tools (for example, anything that writes a
> log file) to fail.
>
> The second outage will be a database migration [1] and will take place on
> Thursday at 17:00UTC. During this window ToolsDB will be read-only. This
> migration should take about an hour but unexpected side-effects may extend
> the downtime.
>
> We try very hard to avoid outages of this magnitude, but at this point we
> need to choose downtime over the increasing risk of data loss.
>
> More details can be found below.
>
>
> [0] NFS Outage and system reboots Monday: The existing toolforge NFS
> server is running on aging hardware and lacks a straightforward path for
> maintenance or upgrading. To improve this we are moving NFS to a cinder+VM
> platform which should support easier upgrades, migrations, and expansions
> in the future. In order to maintain data integrity during the migration,
> the old server will need to be made read-only while the last set of file
> changes is synchronized with the new server. Because the NFS service is
> actively used, it will take many hours to complete the final sync.
>
> To ensure stable mounts of the new server, every node in Toolforge will
> be rebooted as part of this migration. That means that even tools which do
> not use NFS will be affected, although most tools should restart gracefully.
>
> This task is documented as https://phabricator.wikimedia.org/T333477.
>
>
> [1] DB outage Thursday: As part of the ongoing effort to upgrade
> user-created Toolforge databases, we will migrate ToolsDB to a new VM
> that will have a more recent version of Debian and MariaDB and will use a
> more resilient storage solution.
>
> The new VM is ready, and we plan to point all tools to use it on *Apr, 6
> 2023 at 17:00 UTC*.
>
> This will involve about *1 hour of read-only time* for the database. Any
> existing database connection will be terminated, and if your tool does not
> reconnect automatically you might have to restart it manually.
>
> An email will be sent shortly before starting the migration, and when it's
> finished.
>
> Please also make sure your tool is connecting to the database using the
> canonical hostname *tools.db.svc.wikimedia.cloud* and not any other
> hostname or IP address.
>
> For more details, and to report any issue, you can read or leave a comment
> at https://phabricator.wikimedia.org/T333471
>
> For more context you can also check out the parent task
> https://phabricator.wikimedia.org/T301949
>
>
> ___
> Cloud-announce mailing list -- cloud-annou...@lists.wikimedia.org
> List information:
> https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/
> ___
> Cloud mailing list -- cloud@lists.wikimedia.org
> List information:
> https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
>
___
Cloud mailing list -- cloud@lists.wikimedia.org
List information:
https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/