Am 09.12.2015 um 11:21 schrieb Jan Schermer:
> Are you seeing "peering" PGs when the blocked requests are happening? That's 
> what we see regularly when starting OSDs.

Mostly "peering" and "activating".

> I'm not sure this can be solved completely (and whether there are major 
> improvements in newer Ceph versions), but it can be sped up by
> 1) making sure you have free (and not dirtied or fragmented) memory on the 
> node where you are starting the OSD
>       - that means dropping caches before starting the OSD if you have lots 
> of "free" RAM that is used for VFS cache
> 2) starting the OSDs one by one instead of booting several of them
> 3) if you pin the OSDs to CPUs/cores, do that after the OSD is in - I found 
> it to be best to pin the OSD to a cgroup limited to one NUMA node and then 
> limit it to a subset of cores after it has run a bit. OSD tends to use 
> hundreds of % of CPU when booting
> 4) you could possibly prewarm cache for the OSD in /var/lib/ceph/osd...

Thank you for your advice. The use case is not so much after rebooting a
server, but more when we take OSDs in/out for maintenance. During boot, we
already start them one after another with 10s pause between each pair.

I've done a bit of tracing. I've kept a small cluster running with 2 "in" OSDs
out of 3 and put the third one "in" at 15:06:22. From ceph.log:

| 2015-12-09 15:06:22.827030 mon.0 172.20.4.6:6789/0 54964 : cluster [INF]
osdmap e264345: 3 osds: 3 up, 3 in
| 2015-12-09 15:06:22.828693 mon.0 172.20.4.6:6789/0 54965 : cluster [INF]
pgmap v39871295: 1800 pgs: 1800 active+clean; 439 GB data, 906 GB used, 4515
GB / 5421 GB avail; 6406 B/s rd, 889 kB/s wr, 67 op/s
| [...]
| 2015-12-09 15:06:29.163793 mon.0 172.20.4.6:6789/0 54972 : cluster [INF]
pgmap v39871299: 1800 pgs: 1800 active+clean; 439 GB data, 906 GB used, 7700
GB / 8607 GB avail

After a few seconds, backfills start as expected:

| 2015-12-09 15:06:24.853507 osd.3 172.20.4.40:6800/5072 778 : cluster [INF]
410.c9 restarting backfill on osd.2 from (0'0,0'0] MAX to 264336'502426
| [...]
| 2015-12-09 15:06:29.874092 osd.3 172.20.4.40:6800/5072 1308 : cluster [INF]
410.d1 restarting backfill on osd.2 from (0'0,0'0] MAX to 264344'1202983
| 2015-12-09 15:06:32.584907 mon.0 172.20.4.6:6789/0 54973 : cluster [INF]
pgmap v39871300: 1800 pgs: 3 active+remapped+wait_backfill, 191
active+remapped, 1169 active+clean, 437 activating+remapped; 439 GB data, 906
GB used, 7700 GB / 8607 GB avail; 1725 kB/s rd, 2486 kB/s wr, 605 op/s;
23058/278796 objects misplaced (8.271%); 56612 kB/s, 14 objects/s recovering
| 2015-12-09 15:06:24.851307 osd.0 172.20.4.51:6800/4919 2662 : cluster [INF]
410.c8 restarting backfill on osd.2 from (0'0,0'0] MAX to 264344'1017219
| 2015-12-09 15:06:38.555243 mon.0 172.20.4.6:6789/0 54976 : cluster [INF]
pgmap v39871303: 1800 pgs: 22 active+remapped+wait_backfill, 520
active+remapped, 638 active+clean, 620 activating+remapped; 439 GB data, 906
GB used, 7700 GB / 8607
| GB avail; 45289 B/s wr, 4 op/s; 64014/313904 objects misplaced (20.393%)
| 2015-12-09 15:06:38.133376 osd.3 172.20.4.40:6800/5072 1309 : cluster [WRN]
9 slow requests, 9 included below; oldest blocked for > 15.306541 secs
| 2015-12-09 15:06:38.133385 osd.3 172.20.4.40:6800/5072 1310 : cluster [WRN]
slow request 15.305213 seconds old, received at 2015-12-09 15:06:22.828061:
osd_op(client.15205073.0:35726 rbd_header.13998a74b0dc51 [watch reconnect
cookie 139897352489152 gen 37] 410.937870ca ondisk+write+known_if_redirected
e264345) currently reached_pg

It seems that PGs in "activating" state are causing blocked requests.

After a half minute or so, slow requests disappear and backfill proceeds 
normally:

| 2015-12-09 15:06:54.139948 osd.3 172.20.4.40:6800/5072 1396 : cluster [WRN]
42 slow requests, 9 included below; oldest blocked for > 31.188267 secs
| 2015-12-09 15:06:54.139957 osd.3 172.20.4.40:6800/5072 1397 : cluster [WRN]
slow request 15.566440 seconds old, received at 2015-12-09 15:06:38.573403:
osd_op(client.15165527.0:5878994 rbd_data.129a42ae8944a.0000000000000f2b
[set-alloc-hint object_size 4194304 write_size 4194304,write 1728512~4096]
410.de3ce70d snapc 3fd2=[3fd2] ack+ondisk+write+known_if_redirected e264348)
currently waiting for subops from 0,2
| 2015-12-09 15:06:54.139977 osd.3 172.20.4.40:6800/5072 1401 : cluster [WRN]
slow request 15.356852 seconds old, received at 2015-12-09 15:06:38.782990:
osd_op(client.15165527.0:5878997 rbd_data.129a42ae8944a.0000000000000f2b
[set-alloc-hint object_size 4194304 write_size 4194304,write 1880064~4096]
410.de3ce70d snapc 3fd2=[3fd2] ack+ondisk+write+known_if_redirected e264348)
currently waiting for subops from 0,2
| [...]
| 2015-12-09 15:07:00.072403 mon.0 172.20.4.6:6789/0 54989 : cluster [INF]
osdmap e264351: 3 osds: 3 up, 3 in
| 2015-12-09 15:07:00.074536 mon.0 172.20.4.6:6789/0 54990 : cluster [INF]
pgmap v39871313: 1800 pgs: 277 active+remapped+wait_backfill, 881
active+remapped, 4 active+remapped+backfilling, 638 active+clean; 439 GB data,
906 GB used, 7700 GB / 8607 GB avail; 347 kB/s rd, 2551 kB/s wr, 261 op/s;
162079/313904 objects misplaced (51.633%); 218 MB/s, 54 objects/s recovering

I've used Brendan Greggs opensnoop utility to see what is going on on the
filesystem level (see attached log). AFAICS the OSB reads lots of directories.
The underlying filesystem is XFS, so this should be sufficiently fast. During
the time I see slow requests, the OSD continuously opens omap/*.ldb and
omap/*.log files (starting at timestamp 95927.111837 in opensnoop log which
equivalents 15:06:37 in wall clock time).

Any idea how to reduce the blockage at least?

> It's unclear to me whether MONs influence this somehow (the peering stage) 
> but I have observed their CPU usage and IO also spikes when OSDs are started, 
> so make sure they are not under load.

I don't think this is an issue here. Our MONs don't use more than 5% CPU
during the operation and don't cause significant amounts of disk I/O.

Regards

Christian

-- 
Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick

Attachment: osd-opensnoop.log.gz
Description: application/gzip

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to