Hi,
we are operating a ceph / OpenStack cluster at ScaleUp and have slow request
only between 0:00 and 2:00 UTC every day on one PG. At the rest of the time
it is operating without any issue.
I'm new with ceph and this is my first post to this ML, so please be kind.
ceph pg map 5.40
osdmap e30892 pg 5.40 (5.40) -> up [29,20,32] acting [29,20,32]
ceph --version
ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus
(stable)
We created a script which is calling:
/usr/bin/ceph daemon osd.29 dump_historic_ops >> /home/osd29-ops
and
/usr/bin/ceph daemon osd.20 dump_historic_ops >> /home/osd20-ops
every 20 seconds.
Here is one example:
-- snip --
"description": "osd_op(client.3144268599.0:7872481 5.40
5:03b6209d:::rbd_data.7577bc66334873.000000000000000e:head
[stat,write 1064960~4096] snapc ba=[] ondisk+write+known_if_redirected
e30892)",
"initiated_at": "2021-07-14 22:00:14.196286",
"age": 27.823148335999999,
"duration": 18.782248133,
"type_data": {
"flag_point": "commit sent; apply or cleanup",
"client_info": {
"client": "client.3144268599",
"client_addr": "v1:10.23.181.48:0/768948222",
"tid": 7872481
},
"events": [
{
"time": "2021-07-14 22:00:14.196286",
"event": "initiated"
},
{
"time": "2021-07-14 22:00:14.196286",
"event": "header_read"
},
{
"time": "2021-07-14 22:00:14.196287",
"event": "throttled"
},
{
"time": "2021-07-14 22:00:14.196302",
"event": "all_read"
},
{
"time": "2021-07-14 22:00:14.196303",
"event": "dispatched"
},
{
"time": "2021-07-14 22:00:14.196305",
"event": "queued_for_pg"
},
{
"time": "2021-07-14 22:00:14.196460",
"event": "reached_pg"
},
{
"time": "2021-07-14 22:00:14.196470",
"event": "waiting for rw locks"
},
{
"time": "2021-07-14 22:00:17.999868",
"event": "reached_pg"
},
...
(more "reached_pg" and "waiting for rw locks" events)
...
"time": "2021-07-14 22:00:32.662470",
"event": "waiting for rw locks"
},
{
"time": "2021-07-14 22:00:32.885560",
"event": "reached_pg"
},
{
"time": "2021-07-14 22:00:32.885588",
"event": "started"
},
{
"time": "2021-07-14 22:00:32.885661",
"event": "waiting for subops from 20,32"
},
{
"time": "2021-07-14 22:00:32.886279",
"event": "op_commit"
},
{
"time": "2021-07-14 22:00:32.973286",
"event": "sub_op_commit_rec"
},
{
"time": "2021-07-14 22:00:32.978367",
"event": "sub_op_commit_rec"
},
{
"time": "2021-07-14 22:00:32.978411",
"event": "commit_sent"
},
{
"time": "2021-07-14 22:00:32.978534",
"event": "done"
}
]
}
},
-- snap --
grep "waiting for subops" /home/osd29-ops |grep 20 |sort |uniq -c
930 "event": "waiting for subops from 20,30"
7355 "event": "waiting for subops from 20,32"
4862 "event": "waiting for subops from 20,58"
I have found a similar question on the ceph mailing list in 2019.
https://lists.ceph.io/hyperkitty/list/[email protected]/thread/
E6L3MSTC74S5HUZVQ7XG4CFPKVNJDTQI/
which was answered by Wesley Peng
-- snip --
There are too many logs for "waiting for rw locks" that indicates the
system is busy. Maybe you want to scaling more OSDs to improve the
performance.
-- snap --
Is there a way to find out which kind of the system is to busy? Is this Disk-
Io?
An we wonder why is it always only this one PG and no other OSD, and how to
deal with this. If we add an other osd, the PG will perhaps move to an other
osd, but it will still exists and have this issue.
Any help appreciated
Best Regards
Sven
--
—
=================================================
ScaleUp Technologies GmbH & Co. KG
Suederstrasse 198
20537 Hamburg
Germany
Tel.: +49 40 59380500
Fax: +49 40 59380260
Registered Office: Hamburg
Commercial Register Hamburg, HRA 90445
General Partner: ScaleUp Management GmbH
Registered Office: Hamburg
Commercial Register Hamburg, HRB 91902
Directors: Christoph Streit, Gihan Behrmann
www.scaleuptech.com
=================================================
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]