Hi everyone,

I've previously posted about this in Slack, but it was suggested to
bring this to the mailing list, so here I am. The post in slack: [1].

I've recently upgraded a small Squid cluster to Tentacle (from 19.2.3
straight to 20.2.1).

The cluster has 3 nodes (Debian Bookworm, 2x EPYC 7371, 256gig memory)
and is primarily used for RBD (VM images) and RGW/S3. There are 43
OSDs, 31 Intel SATA SSDs (10 per node, but the first node has 1 extra)
and 12 Seagate SAS SSDs (4 per node). The SATA ssds have device class
"ssd" with ~50% average utilization, the SAS SSDs have device class
"sas-ssd" with ~15% average utilization and serve different pools. RGW
only utilizes the ssd class.

The cluster has been continuously upgraded from (I think) Mimic or
Nautilus. It was originally deployed using ceph-deploy and switched to
cephadm as soon as that was stable.

Under Squid the cluster was usually running around 1-3% CPU most of the
time, since the applications aren't super IO heavy. During off-hours we
only see about 5-8k iops usually.

With the upgrade to Tentacle the nodes jumped to about 60% CPU
utilization and 150k iops (Hosts Overall Performance dashboard). All
that while ceph status and all the metrics I could find in
Grafana/Prometheus don't report increased throughput, iops or even
latency. Checking on the nodes directly (e.g. with htop's IO view) all
the OSDs in the ssd class consistently generate quite a bit throughput.

The iops are generated by the OSDs, the CPU load is generated both by
the OSDs and RGW.

RGW is usually deployed with 3 instances (1 per node), but the issue
also appears when running with just 1. It is currently single-site, but
I have toyed around with multi-site a while back on Squid.

Increasing the logging (debug_rgw 20/5 & debug_ms 1/5) showed that it
often attempted to get data_log.* objects but failed to do so. The
.rgw.log pool only contained a data_log.0 object, blindly creating
empty data_log.{1..127} objects also didn't silence the log line.

An example log:

osd.2 v2:_._._._:6843/2369122728 33152999 ==== osd_op_reply(286992957
data_log.20 [call] v109598'11820244296 uv0 ondisk = -61 ((61) No data
available)) ==== 155+0+0 (crc 0 0 0) 0x55e65cff3400 con 0x55e656ad7800 

When I restart the single instance, the load returns back to normal for
exactly 20 minutes, after which it returns to roughly the same level.
This pointed me to the `rgw_sync_log_trim_interval` config. Both the
log lines and this setting seem to be related to multi-site replication
logs.

Changing the option to another value also changes the delay until the
issue appears, which makes me pretty certain the issue is triggered by
one of the coroutines spawned by RGWSyncLogTrimThread.

Luckily setting the interval to 0 doesn't start that thread, which I
have set now to prevent RGW from burning through my flash cells and to
keep power usage in check.

I've also temporarily swapped out the RGW container image for
quay.ceph.io/ceph-
ci/ceph:tentacle-debug@sha256:1b89786ce81c900227a2077be33dadda8722006b0
45b409e91e1ed89d7547be1 (I hope that was the right image to get the
latest backports for tentacle), but that version also has this issue.

I've also created a flamegraph with perf (executed perf outside the
container and then converted to svg inside the ceph container), but it
doesn't seem super useful to me since the interesting symbols seems to
be missing. I've attached it anyway.

Should I go ahead and report this to tracker.ceph.com or does anyone
have ideas on what to do here?

Thanks in advance for any input!

~ Phillip

[1]:
https://ceph-storage.slack.com/archives/C1HFU4JK1/p1776029060801259
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to