[ceph-users] RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake

Christian Theune Fri, 09 Jun 2023 02:17:46 -0700

Hi,

we are running a cluster that has been alive for a long time and we tread 
carefully regarding updates. We are still a bit lagging and our cluster (that 
started around Firefly) is currently at Nautilus. We’re updating and we know 
we’re still behind, but we do keep running into challenges along the way that 
typically are still unfixed on main and - as I started with - have to tread 
carefully.


Nevertheless, mistakes happen, and we found ourselves in this situation: we 
converted our RGW data pool from replicated (n=3) to erasure coded (k=10, m=3, 
with 17 hosts) but when doing the EC profile selection we missed that our hosts 
are not evenly balanced (this is a growing cluster and some machines have 
around 20TiB capacity for the RGW data pool, wheres newer machines have around 
160TiB and we rather should have gone with k=4, m=3.  In any case, having 13 
chunks causes too many hosts to participate in each object. Going for k+m=7 
will allow distribution to be more effective as we have 7 hosts that have the 
160TiB sizing.

Our original migration used the “cache tiering” approach, but that only works 
once when moving from replicated to EC and can not be used for further 
migrations.

The amount of data is at 215TiB somewhat significant, so using an approach that 
scales when copying data[1] to avoid ending up with months of migration.

I’ve run out of ideas doing this on a low-level (i.e. trying to fix it on a 
rados/pool level) and I guess we can only fix this on an application level 
using multi-zone replication.

I have the setup nailed in general, but I’m running into issues with buckets in 
our staging and production environment that have `explicit_placement` pools 
attached, AFAICT is this an outdated mechanisms but there are no migration 
tools around. I’ve seen some people talk about patched versions of the 
`radosgw-admin metadata put` variant that (still) prohibits removing explicit 
placements.

AFAICT those explicit placements will be synced to the secondary zone and the 
effect that I’m seeing underpins that theory: the sync runs for a while and 
only a few hundred objects show up in the new zone, as the buckets/objects are 
already found in the old pool that the new zone uses due to the explicit 
placement rule.

I’m currently running out of ideas, but open for any other options.

Looking at 
https://lists.ceph.io/hyperkitty/list/[email protected]/thread/ULKK5RU2VXLFXNUJMZBMUG7CQ5UCWJCB/#R6CPZ2TEWRFL2JJWP7TT5GX7DPSV5S7Z
 I’m wondering whether the relevant patch is available somewhere, or whether 
I’ll have to try building that patch again on my own.

Going through the docs and the code I’m actually wondering whether 
`explicit_placement` is actually a really crufty residual piece that won’t get 
used in newer clusters but older clusters don’t really have an option to get 
away from?

In my specific case, the placement rules are identical to the explicit 
placements that are stored on (apparently older) buckets and the only thing I 
need to do is to remove them. I can accept a bit of downtime to avoid any race 
conditions if needed, so maybe having a small tool to just remove those entries 
while all RGWs are down would be fine. A call to `radosgw-admin bucket stat` 
takes about 18s for all buckets in production and I guess that would be a good 
comparison for what timing to expect when running an update on the metadata.

I’ll also be in touch with colleagues from Heinlein and 42on but I’m open to 
other suggestions.

Hugs,
Christian

[1] We currently have 215TiB data in 230M objects. Using the “official” 
“cache-flush-evict-all” approach was unfeasible here as it only yielded around 
50MiB/s. Using cache limits and targetting the cache sizes to 0 caused proper 
parallelization and was able to flush/evict at almost constant 1GiB/s in the 
cluster. 


-- 
Christian Theune · [email protected] · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake

Reply via email to