Hi everyone! TL;DR; Currently there's a degradation on the service for VMs and anything running on them (ex. toolforge, quarry, paws, ...), you might be able to use the services or they might become too slow, we are working on it and will update when fixed.
Long story: We were adding a new ceph node to the ceph cluster. This time the node was in a different subnet, but ceph is supposed to be transparently able to work with many subnets. For some reason the new node was added to the cluster, but it's missing to reply to any heartbeats sent from any other nodes in the cluster and that causes the cluster to keep rebalancing data around, what creates a continuous IO slowness for any clients (like VMs). We are trying to minimize the impact by limiting the amount of data that gets re-shuffled, that slows down the intervention a bit, but should improve the client experience. We are actively working on this, and will update with any changes. Cheers! -- David Caro SRE - Cloud Services Wikimedia Foundation <https://wikimediafoundation.org/> PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3 "Imagine a world in which every single human being can freely share in the sum of all knowledge. That's our commitment."
signature.asc
Description: PGP signature
_______________________________________________ Cloud-announce mailing list -- [email protected] List information: https://lists.wikimedia.org/postorius/lists/cloud-announce.lists.wikimedia.org/
_______________________________________________ Cloud mailing list -- [email protected] List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
