[ceph-users] Re: Slow Operations observed on one OSD (dedicated for RGW indexes), caused by problematic Placement Group

P. O. Mon, 13 Jan 2020 01:14:25 -0800

Hi Casey, Igor
thank you for your response


Casey:

This cluster is secondary zone in Multi-Site setup (placement pools have
been set as "default.")

[root@mon-1 ~]# radosgw-admin zone get
{
    "id": "96b9adcb-fdc0-4f36-a310-89347b74ea67",
    "name": "secondary_zone",
    "placement_pools": [
        {
            "key": "default-placement",
            "val": {
                "index_pool": "default.rgw.buckets.index",


Igor:

All of the osd are backed by bluestore. There are NVME disks dedicated for
index pool. I dont use separated db and wal devices.
Creation command: ceph-volume --cluster ceph lvm create --bluestore --data
/dev/nvme0n1p9

These osds were deployed in Mimic (v13.2.5).

Last Friday, i upgraded cluster to 14.2.6, and I purged and recreated this
osd. And I don't see this problem anymore.
I hope this helped.




Below, I paste the result of debugging the problem:


1) listing omap keys
[root@mon-1 ~]# time rados -p default.rgw.buckets.index listomapkeys
.dir.99a07ed8-2112-429b-9f94-81383220a95b.33721775.254244.31 > omapkeys.list

real 0m29.475s
user 0m0.123s
sys 0m0.131s

[root@osd-n57 ~]# tail -f /var/log/ceph/ceph-osd.62.log |grep "slow
operation "
2020-01-10 09:50:36.274 7f1fa22dd700  0
bluestore(/var/lib/ceph/osd/ceph-62) log_latency slow operation observed
for next, latency = 29.0901s


2) listing omap vals
[dc-2 root@mon-1 tmp]# time rados -p default.rgw.buckets.index listomapvals
.dir.99a07ed8-2112-429b-9f94-81383220a95b.33721775.254244.31 > omapvals.list

real 0m29.080s
user 0m2.801s
sys 0m0.270s

[root@osd-n57 ~]# tail -f /var/log/ceph/ceph-osd.62.log |grep "slow
operation "
2020-01-10 09:52:37.164 7f1fa22dd700  0
bluestore(/var/lib/ceph/osd/ceph-62) log_latency slow operation observed
for next, latency = 25.6445s


3) listing bilogs for this bucket (ID:
99a07ed8-2112-429b-9f94-81383220a95b.33721775.254244)
[root@mon-1 ~]# radosgw-admin bilog list --bucket bucket_1  #(the same
effect)

[root@osd-n57 ~]# tail -f /var/log/ceph/ceph-osd.62.log |grep "slow
operation "
2020-01-10 10:01:27.896 7f1fa62e5700  0
bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow operation observed
for upper_bound, latency = 26.7958s, after = �0_ omap_iterator(cid =
230.2c_head, oid =
#230:36f53ff9:::.dir.99a07ed8-2112-429b-9f94-81383220a95b.33721775.254244.31:head#)
2020-01-10 10:01:28.981 7f1fbfb18700 -1 osd.62 23457 get_health_metrics
reporting 2 slow ops, oldest is osd_op(client.200549558.0:308265358 230.2c
230:35a96e8b:::.dir.99a07ed8-2112-429b-9f94-81383220a95b.49439529.1.85:head
[call rgw.bi_log_trim] snapc 0=[] ondisk+write+known_if_redirected e23457)

4) bilogs GC, while trimming bucket index logs from this bucket:
.dir.99a07ed8-2112-429b-9f94-81383220a95b.33721775.254244.31 (the same
effect) as above

I checked that a lot of operations in cluster are blocked while some other
process is touching omap shard:
.dir.99a07ed8-2112-429b-9f94-81383220a95b.33721775.254244.31.
For example s3cmd listing:


First scenario: List objects from random bucket, with using s3cmd.

[root@po:~ ]$ time s3cmd -c .s3cfg.n123 --no-ssl ls s3://bucket_n123/
--limit 1 -r
2019-09-10 10:38    511600   s3://bucket_n123/BnxKLbY2mN7Jw5b4pdkiU.jpg

real 0m0,244s
user 0m0,142s
sys 0m0,041s



Second scenario: List objects from random bucket, with using s3cmd +
listing omapkeys
of.dir.99a07ed8-2112-429b-9f94-81383220a95b.33721775.254244.31 (at the same
time):

[root@mon-1 ~]# time rados -p default.rgw.buckets.index listomapkeys
.dir.99a07ed8-2112-429b-9f94-81383220a95b.33721775.254244.31

[root@po:~ ]$ time s3cmd -c .s3cfg.n123 --no-ssl ls s3://bucket_n123/
--limit 1 -r
2019-09-10 10:38    511600   s3://bucket_n123/BnxKLbY2mN7Jw5b4pdkiU.jpg

real 0m25,802s
user 0m0,211s
sys 0m0,040s

[root@osd-n57 ~]# tail -f /var/log/ceph/ceph-osd.62.log |grep "slow
operation "
2020-01-10 10:08:40.942 7f1fbfb18700 -1 osd.62 23480 get_health_metrics
reporting 60 slow ops, oldest is osd_op(client.200549558.0:313502542 230.2c
230:34aaa3d7:::.dir.99a07ed8-2112-429b-9f94-81383220a95b.51788790.1.2:head
[call rgw.guard_bucket_resharding,call rgw.bucket_prepare_op] snapc 0=[]
ondisk+write+known_if_redirected e23480)


Thanks,
Piotr


pt., 10 sty 2020 o 15:30 Igor Fedotov <[email protected]> napisał(a):

> Hi Piotr,
>
> wondering what's the device layout behind OSD/Bluestore? Do you have fast
> (SSD or NVMe) standalone device for DB/WAL at slow OSDs? Were OSDs in
> question deployed in  Nautilus or do they originate from the earlier Ceph
> releases?
>
> We've got multiple reports about slow omap operations at different
> clusters from our customers. In every case one had rotational-only (!)
> media behind OSD and RGW payload. I.e. slow DB device and high amount of
> metadata going to RocksDB. And  behavior most likely was triggered by a
> massive prior data removal.  Looks like RocksDB behind BlueStore has some
> performance gaps when dealing with highly fragmented DB at slow media.
> Nautilus has got some improvements in this respect but chances are this
> isn't enough.
>
> Manual RocksDB compaction seems to be a short-term workaround if the above
> is applicable. And the general advise would be to have faster device for DB.
>
>
> Thanks,
>
> Igor
>
> PS. The above doesn't mean that RGW is out of corresponding errors or/and
> is absolutely optimal. Most probably it has some areas for improvement
> which might resolve the issue as well. Which is likely to be an alternative
> approach for the fix. But I'm not aware of them.
>
> PS 2. Potentially the same RocksDB slowdown might also occur if OSD has
> large DB data spillover from fast DB device to slow main one. Nautilus has
> corresponding alerts available via ceph -s / ceph health commands.
>
>
> On 1/9/2020 6:33 PM, P. O. wrote:
>
> Hi All,
>
> I am observing strange behaviour on one OSD, which is dedicated for RGW
> index (pool: default.rgw.buckets.index).
> My ceph is: ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba)
> nautilus (stable)
>
> In log files are many messages like below:
> 2020-01-08 10:37:53.054 7f1fa22dd700  0
> bluestore(/var/lib/ceph/osd/ceph-62) log_latency slow operation observed
> for next, latency = 22.4641s
> ...
> 2020-01-08 14:57:06.987 7f1fa22dd700  0
> bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow operation observed
> for upper_bound, latency = 28.9336s, after = �0_ omap_iterator(cid =
> 230.2c_head, oid =
> #230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)
> 2020-01-08 15:17:39.972 7f1fa22dd700  0
> bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow operation observed
> for upper_bound, latency = 23.9168s, after = �0_ omap_iterator(cid =
> 230.2c_head, oid =
> #230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)
> 2020-01-08 15:37:44.221 7f1fa62e5700  0
> bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow operation observed
> for upper_bound, latency = 25.8316s, after = �0_ omap_iterator(cid =
> 230.2c_head, oid =
> #230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)
> 2020-01-08 15:58:12.279 7f1fa62e5700  0
> bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow operation observed
> for upper_bound, latency = 27.8207s, after = �0_ omap_iterator(cid =
> 230.2c_head, oid =
> #230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)
>
> I discovered that these times are associated with GC bilogs (it runs every
> 30 minutes).
>
> # /etc/ceph/ceph.conf
> rgw sync log trim interval = 1200
>
>
> ID of bucket bucket_1: 12307ed8-2342-4293-9f14-12234220a234.13221775.224244
> Shard nr. 31 of bucket (bucket_1) index:
> .dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31
>
> [root@mon-1 ~]# ceph osd map default.rgw.buckets.index
> .dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31
> osdmap e23417 pool 'default.rgw.buckets.index' (230) object
> '.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31' -> pg
> 230.9ffcaf6c (230.2c) -> up ([62,56,60], p62) acting ([62,56,60], p62)
>
> OSD: 62,56,60
> Problematic OSD: 62
> Problematic PG: 230.2c
>
>
> digging deeper..
>
> I checked that when I run the command:
> [root@mon-1 ~]# radosgw-admin bilog list --bucket bucket_1
> []
>
> Even when there are no bilogs, there are the same symptoms:
>
> 1) In logs:
> 2020-01-08 16:31:24.728 7f1fa22dd700  0
> bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow operation observed
> for upper_bound, latency = 28.653s, after = �0_ omap_iterator(cid =
> 230.2c_head, oid =
> #230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)
>
> 2) Any other radosgw-admin commands are very slow while other process is
> iterating through bilogs (of this bucket).
>
> [dc-2 root@mon-1 ~]# time radosgw-admin bucket stats --bucket=bucket234
> ...
> real 0m25.641s
> user 0m0.148s
> sys 0m0.075s
>
> When i try to run deep scrub on this pg 230.2c, in logs appears:
>
> [root@mon-1 ~]# ceph pg deep-scrub 230.2c
>
> 2020-01-08 16:38:32.551 7f1fc50f0700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1fa22dd700' had timed out after 15
> ...
> 2020-01-08 16:38:46.735 7f1fc60f2700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1fa62e5700' had timed out after 15
> 2020-01-08 16:38:46.735 7f1fc60f2700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1fa22dd700' had timed out after 15
> 2020-01-08 16:38:46.735 7f1fc60f2700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f1fa62e5700' had timed out after 15
> 2020-01-08 16:38:46.911 7f1fa22dd700  0
> bluestore(/var/lib/ceph/osd/ceph-62) log_latency_fn slow operation observed
> for upper_bound, latency = 29.8655s, after = <80>0_ omap_iterator(cid =
> 230.2c_head, oid =
> #230:36f53ff9:::.dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31:head#)
> 2020-01-08 16:38:46.911 7f1fa22dd700  1 heartbeat_map reset_timeout
> 'OSD::osd_op_tp thread 0x7f1fa22dd700' had timed out after 15
> 2020-01-08 16:38:46.911 7f1fa62e5700  1 heartbeat_map reset_timeout
> 'OSD::osd_op_tp thread 0x7f1fa62e5700' had timed out after 15
>
> 2020-01-08 10:37:53.054 7f1fa22dd700  0
> bluestore(/var/lib/ceph/osd/ceph-62) log_latency slow operation observed
> for next, latency = 22.4641s
>
>
> When i try to iterate over listomapkeys for:
> .dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31
> [root@mon-1 ~]# rados -p default.rgw.buckets.index listomapkeys
> .dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31
> In logs i observe:
>
> 2020-01-09 11:08:58.847 7f1fa62e5700  0
> bluestore(/var/lib/ceph/osd/ceph-62) log_latency slow operation observed
> for next, latency = 24.1949s
>
> The same for listomapvals
>
> [root@mon-1 ~]# rados -p default.rgw.buckets.index listomapvals
> .dir.12307ed8-2342-4293-9f14-12234220a234.13221775.224244.31
>
> 2020-01-09 11:11:36.508 7f1fa62e5700  0
> bluestore(/var/lib/ceph/osd/ceph-62) log_latency slow operation observed
> for next, latency = 28.2137s
>
>
> I tried to purge and recreate this OSD, also I trimmed bilogs for this
> bucket (radosgw-admin bilog trim --bucket bucket_1), but this didn't help.
>
> I suspect that any operations on PG: 230.2c on OSD: 62 can causing slow
> operations on my cluster.
> I have no idea how to debug/resolve this problem. Any help/hint is
> appreciated!
>
>
> Thanks,
> Piotr
>
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
>

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Slow Operations observed on one OSD (dedicated for RGW indexes), caused by problematic Placement Group

Reply via email to