Re: [ceph-users] Random OSDs respawning continuously

2015-02-13 Thread Mohamed Pakkeer
Hi all,

  When i stop the respawning osd on an OSD node, another osd is respawning
 on the same node. when the OSD is started to respawing, it puts the
following info in the osd log.

slow request 31.129671 seconds old, received at 2015-02-13 19:09:32.180496:
osd_op(*osd.551*.95229:11 191 10005c4.0033 [copy-get max 8388608]
13.f4ccd256 RETRY=50
ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
e95518) currently reached_pg

OSD.551 is part of cache tier. All the respawning osds have the log with
different cache tier OSDs. If i restart all the osds in the cache tier osd
node, respawning is stopped  and cluster become active + clean state. But
when i try to write some data on the cluster, random osd starts the
respawning.

can anyone help me how to solve this issue?


  2015-02-13 19:10:02.309848 7f53eef54700  0 log_channel(default) log [WRN]
: 11 slow requests, 11 included below; oldest blocked for  30.132629 secs
2015-02-13 19:10:02.309854 7f53eef54700  0 log_channel(default) log [WRN] :
slow request 30.132629 seconds old, received at 2015-02-13 19:09:32.177075:
osd_op(osd.551.95229:63
 10002ae. [copy-from ver 7622] 13.7273b256 RETRY=130 snapc 1=[]
ondisk+retry+write+ignore_overlay+enforce_snapc+known_if_redirected e95518)
currently reached_pg
2015-02-13 19:10:02.309858 7f53eef54700  0 log_channel(default) log [WRN] :
slow request 30.131608 seconds old, received at 2015-02-13 19:09:32.178096:
osd_op(osd.551.95229:41
5 10003a0.0006 [copy-get max 8388608] 13.aefb256 RETRY=118
ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
e95518) currently reached_pg
2015-02-13 19:10:02.309861 7f53eef54700  0 log_channel(default) log [WRN] :
slow request 30.130994 seconds old, received at 2015-02-13 19:09:32.178710:
osd_op(osd.551.95229:26
83 100029d.003b [copy-get max 8388608] 13.a2be1256 RETRY=115
ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
e95518) currently reached_pg
2015-02-13 19:10:02.309864 7f53eef54700  0 log_channel(default) log [WRN] :
slow request 30.130426 seconds old, received at 2015-02-13 19:09:32.179278:
osd_op(osd.551.95229:39
39 10004e9.0032 [copy-get max 8388608] 13.6a25b256 RETRY=105
ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
e95518) currently reached_pg
2015-02-13 19:10:02.309868 7f53eef54700  0 log_channel(default) log [WRN] :
slow request 30.129697 seconds old, received at 2015-02-13 19:09:32.180007:
osd_op(osd.551.95229:97
49 1000553.007e [copy-get max 8388608] 13.c8645256 RETRY=59
ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
e95518) currently reached_pg
2015-02-13 19:10:03.310284 7f53eef54700  0 log_channel(default) log [WRN] :
11 slow requests, 6 included below; oldest blocked for  31.133092 secs
2015-02-13 19:10:03.310305 7f53eef54700  0 log_channel(default) log [WRN] :
slow request 31.129671 seconds old, received at 2015-02-13 19:09:32.180496:
osd_op(osd.551.95229:11
191 10005c4.0033 [copy-get max 8388608] 13.f4ccd256 RETRY=50
ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
e95518) currently reached_pg
2015-02-13 19:10:03.310308 7f53eef54700  0 log_channel(default) log [WRN] :
slow request 31.128616 seconds old, received at 2015-02-13 19:09:32.181551:
osd_op(osd.551.95229:12
903 10002e4.00d6 [copy-get max 8388608] 13.f56a3256 RETRY=41
ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
e95518) currently reached_pg
2015-02-13 19:10:03.310322 7f53eef54700  0 log_channel(default) log [WRN] :
slow request 31.127807 seconds old, received at 2015-02-13 19:09:32.182360:
osd_op(osd.551.95229:14
165 1000480.0110 [copy-get max 8388608] 13.fd8c1256 RETRY=32
ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
e95518) currently reached_pg
2015-02-13 19:10:03.310327 7f53eef54700  0 log_channel(default) log [WRN] :
slow request 31.127320 seconds old, received at 2015-02-13 19:09:32.182847:
osd_op(osd.551.95229:15
013 100047f.0133 [copy-get max 8388608] 13.b7b05256 RETRY=27
ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
e95518) currently reached_pg
2015-02-13 19:10:03.310331 7f53eef54700  0 log_channel(default) log [WRN] :
slow request 31.126935 seconds old, received at 2015-02-13 19:09:32.183232:
osd_op(osd.551.95229:15
767 100066d.001e [copy-get max 8388608] 13.3b017256 RETRY=25
ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
e95518) currently reached_pg
2015-02-13 19:10:04.310685 7f53eef54700  0 log_channel(default) log [WRN] :
11 slow requests, 1 included below; oldest blocked for  32.133566 secs
2015-02-13 19:10:04.310705 7f53eef54700  0 log_channel(default) log [WRN] :
slow request 32.126584 seconds old, received at 2015-02-13 19:09:32.184057:
osd_op(osd.551.95229:16
293 1000601.0029 [copy-get max 8388608] 

Re: [ceph-users] Random OSDs respawning continuously

2015-02-13 Thread Gregory Farnum
It's not entirely clear, but it looks like all the ops are just your
caching pool OSDs trying to promote objects, and your backing pool OSD's
aren't fast enough to satisfy all the IO demanded of them. You may be
overloading the system.
-Greg
On Fri, Feb 13, 2015 at 6:06 AM Mohamed Pakkeer mdfakk...@gmail.com wrote:

 Hi all,

   When i stop the respawning osd on an OSD node, another osd is respawning
  on the same node. when the OSD is started to respawing, it puts the
 following info in the osd log.

 slow request 31.129671 seconds old, received at 2015-02-13
 19:09:32.180496: osd_op(*osd.551*.95229:11 191 10005c4.0033
 [copy-get max 8388608] 13.f4ccd256 RETRY=50
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg

 OSD.551 is part of cache tier. All the respawning osds have the log with
 different cache tier OSDs. If i restart all the osds in the cache tier osd
 node, respawning is stopped  and cluster become active + clean state. But
 when i try to write some data on the cluster, random osd starts the
 respawning.

 can anyone help me how to solve this issue?


   2015-02-13 19:10:02.309848 7f53eef54700  0 log_channel(default) log
 [WRN] : 11 slow requests, 11 included below; oldest blocked for  30.132629
 secs
 2015-02-13 19:10:02.309854 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 30.132629 seconds old, received at 2015-02-13
 19:09:32.177075: osd_op(osd.551.95229:63
  10002ae. [copy-from ver 7622] 13.7273b256 RETRY=130 snapc
 1=[] ondisk+retry+write+ignore_overlay+enforce_snapc+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:02.309858 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 30.131608 seconds old, received at 2015-02-13
 19:09:32.178096: osd_op(osd.551.95229:41
 5 10003a0.0006 [copy-get max 8388608] 13.aefb256 RETRY=118
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:02.309861 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 30.130994 seconds old, received at 2015-02-13
 19:09:32.178710: osd_op(osd.551.95229:26
 83 100029d.003b [copy-get max 8388608] 13.a2be1256 RETRY=115
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:02.309864 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 30.130426 seconds old, received at 2015-02-13
 19:09:32.179278: osd_op(osd.551.95229:39
 39 10004e9.0032 [copy-get max 8388608] 13.6a25b256 RETRY=105
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:02.309868 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 30.129697 seconds old, received at 2015-02-13
 19:09:32.180007: osd_op(osd.551.95229:97
 49 1000553.007e [copy-get max 8388608] 13.c8645256 RETRY=59
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:03.310284 7f53eef54700  0 log_channel(default) log [WRN]
 : 11 slow requests, 6 included below; oldest blocked for  31.133092 secs
 2015-02-13 19:10:03.310305 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 31.129671 seconds old, received at 2015-02-13
 19:09:32.180496: osd_op(osd.551.95229:11
 191 10005c4.0033 [copy-get max 8388608] 13.f4ccd256 RETRY=50
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:03.310308 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 31.128616 seconds old, received at 2015-02-13
 19:09:32.181551: osd_op(osd.551.95229:12
 903 10002e4.00d6 [copy-get max 8388608] 13.f56a3256 RETRY=41
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:03.310322 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 31.127807 seconds old, received at 2015-02-13
 19:09:32.182360: osd_op(osd.551.95229:14
 165 1000480.0110 [copy-get max 8388608] 13.fd8c1256 RETRY=32
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:03.310327 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 31.127320 seconds old, received at 2015-02-13
 19:09:32.182847: osd_op(osd.551.95229:15
 013 100047f.0133 [copy-get max 8388608] 13.b7b05256 RETRY=27
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 e95518) currently reached_pg
 2015-02-13 19:10:03.310331 7f53eef54700  0 log_channel(default) log [WRN]
 : slow request 31.126935 seconds old, received at 2015-02-13
 19:09:32.183232: osd_op(osd.551.95229:15
 767 100066d.001e [copy-get max 8388608] 13.3b017256 RETRY=25
 ack+retry+read+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected
 

[ceph-users] Random OSDs respawning continuously

2015-02-12 Thread Mohamed Pakkeer
Hi all,

Cluster : 540 OSDs , Cache tier and EC pool
ceph version 0.87


cluster c2a97a2f-fdc7-4eb5-82ef-70c52f2eceb1
 health HEALTH_WARN 10 pgs peering; 21 pgs stale; 2 pgs stuck inactive;
2 pgs stuck unclean; 287 requests are blocked  32 sec; recovery 24/6707031
objects degraded (0.000%); too few pgs per osd (13  min 20); 1/552 in osds
are down; clock skew detected on mon.master02, mon.master03
 monmap e3: 3 mons at {master01=
10.1.2.231:6789/0,master02=10.1.2.232:6789/0,master03=10.1.2.233:6789/0},
election epoch 4, quorum 0,1,2 master01,master02,master03
 mdsmap e17: 1/1/1 up {0=master01=up:active}
 osdmap e57805: 552 osds: 551 up, 552 in
  pgmap v278604: 7264 pgs, 3 pools, 2027 GB data, 547 kobjects
3811 GB used, 1958 TB / 1962 TB avail
24/6707031 objects degraded (0.000%)
   7 stale+peering
   3 peering
7240 active+clean
  13 stale
   1 stale+active




We have mounted ceph using ceph-fuse client . Suddenly some of osds are re
spawning continuously. Still cluster health is unstable. How to stop the
respawning osds?



2015-02-12 18:41:51.562337 7f8371373900  0 ceph version 0.87
(c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-osd, pid 3911
2015-02-12 18:41:51.564781 7f8371373900  0
filestore(/var/lib/ceph/osd/ceph-538) backend xfs (magic 0x58465342)
2015-02-12 18:41:51.564792 7f8371373900  1
filestore(/var/lib/ceph/osd/ceph-538)  disabling 'filestore replica
fadvise' due to known issues with fadvise(DONTNEED) on xfs
2015-02-12 18:41:51.655623 7f8371373900  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-538) detect_features: FIEMAP
ioctl is supported and appears to work
2015-02-12 18:41:51.655639 7f8371373900  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-538) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2015-02-12 18:41:51.663864 7f8371373900  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-538) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2015-02-12 18:41:51.663910 7f8371373900  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-538) detect_feature: extsize is
disabled by conf
2015-02-12 18:41:51.994021 7f8371373900  0
filestore(/var/lib/ceph/osd/ceph-538) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2015-02-12 18:41:52.788178 7f8371373900  1 journal _open
/var/lib/ceph/osd/ceph-538/journal fd 20: 5367660544 bytes, block size 4096
bytes, directio = 1, aio = 1
2015-02-12 18:41:52.848430 7f8371373900  1 journal _open
/var/lib/ceph/osd/ceph-538/journal fd 20: 5367660544 bytes, block size 4096
bytes, directio = 1, aio = 1
2015-02-12 18:41:52.922806 7f8371373900  1 journal close
/var/lib/ceph/osd/ceph-538/journal
2015-02-12 18:41:52.948320 7f8371373900  0
filestore(/var/lib/ceph/osd/ceph-538) backend xfs (magic 0x58465342)
2015-02-12 18:41:52.981122 7f8371373900  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-538) detect_features: FIEMAP
ioctl is supported and appears to work
2015-02-12 18:41:52.981137 7f8371373900  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-538) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2015-02-12 18:41:52.989395 7f8371373900  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-538) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2015-02-12 18:41:52.989440 7f8371373900  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-538) detect_feature: extsize is
disabled by conf
2015-02-12 18:41:53.149095 7f8371373900  0
filestore(/var/lib/ceph/osd/ceph-538) mount: WRITEAHEAD journal mode
explicitly enabled in conf
2015-02-12 18:41:53.154258 7f8371373900  1 journal _open
/var/lib/ceph/osd/ceph-538/journal fd 20: 5367660544 bytes, block size 4096
bytes, directio = 1, aio = 1
2015-02-12 18:41:53.217404 7f8371373900  1 journal _open
/var/lib/ceph/osd/ceph-538/journal fd 20: 5367660544 bytes, block size 4096
bytes, directio = 1, aio = 1
2015-02-12 18:41:53.467512 7f8371373900  0 cls
cls/hello/cls_hello.cc:271: loading cls_hello
2015-02-12 18:41:53.563846 7f8371373900  0 osd.538 54486 crush map has
features 104186773504, adjusting msgr requires for clients
2015-02-12 18:41:53.563865 7f8371373900  0 osd.538 54486 crush map has
features 379064680448 was 8705, adjusting msgr requires for mons
2015-02-12 18:41:53.563869 7f8371373900  0 osd.538 54486 crush map has
features 379064680448, adjusting msgr requires for osds
2015-02-12 18:41:53.563888 7f8371373900  0 osd.538 54486 load_pgs
2015-02-12 18:41:55.430730 7f8371373900  0 osd.538 54486 load_pgs opened
137 pgs
2015-02-12 18:41:55.432854 7f8371373900 -1 osd.538 54486
set_disk_tp_priority(22) Invalid argument: osd_disk_thread_ioprio_class is
 but only the following values are allowed:
 idle, be or rt
2015-02-12 18:41:55.442748 7f835dfc8700  0 osd.538 54486 ignoring osdmap
until we have initialized
2015-02-12 18:41:55.456802 7f835dfc8700  0 osd.538 54486 ignoring osdmap
until we have