On Mon, Feb 23, 2015 at 8:59 AM, Chris Murray <chrismurra...@gmail.com> wrote:
> ... Trying to send again after reporting bounce backs to dreamhost ...
> ... Trying to send one more time after seeing mails come through the
> list today ...
>
> Hi all,
>
> First off, I should point out that this is a 'small cluster' issue and
> may well be due to the stretched resources. If I'm doomed to destroying
> and starting again, fair be it, but I'm interested to see if things can
> get up and running again.
>
> My experimental ceph cluster now has 5 nodes with 3 osds each. Some
> drives are big, some drives are small. Most are formatted with BTRFS and
> two are still formatted with XFS, which I intend to remove and recreate
> with BTRFS at some point. I gather BTRFS isn't entirely stable yet, but
> compression suits my use-case, so I'm prepared to stick with it while it
> matures. I had to set the following, to avoid osds dying as the IO was
> consumed by the snapshot creation and deletion process (as I understand
> it):
>
>     filestore btrfs snap = false
>
> and the mount options look like this:
>
>     osd mount options btrfs =
> rw,noatime,space_cache,user_subvol_rm_allowed,compress-force=lzo
>
> Each node is a HP Microserver n36l or n54l, with 8GB of memory, so CPU
> horsepower is lacking somewhat. Ceph is version 0.80.8, and each node is
> also a mon.
>
> My issue is: After adding the 15th osd, the cluster went into a spiral
> of destruction, with osds going down one after another. One might go
> down on occasion, and usually a start of the osd in question will remedy
> things. This time, though, it hasn't, and the problem appears to have
> become worse and worse. I've tried starting osds, restarting whole
> hosts, to no avail. I've brought all osds back 'in' and set noup, nodown
> and noout. I've ceased rbd activity since it was getting blocked anyway.
> The cluster appears to now be 'stuck' in this state:
>
>     cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
>      health HEALTH_WARN 1 pgs backfill; 45 pgs backfill_toofull; 1969
> pgs degraded; 1226 pgs down; 2 pgs incomplete; 1333 pgs peering; 1445
> pgs stale; 1336 pgs stuck inactive; 1445 pgs stuck stale; 4198 pgs stuck
> unclean; recovery 838948/2578420 objects degraded (32.537%); 2 near full
> osd(s); 8/15 in osds are down; noup,nodown,noout flag(s) set
>      monmap e5: 5 mons at
> {0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0,3=
> 192.168.12.28:6789/0,4=192.168.12.29:6789/0}, election epoch 2618,
> quorum 0,1,2,3,4 0,1,2,3,4
>      osdmap e63276: 15 osds: 7 up, 15 in
>             flags noup,nodown,noout
>       pgmap v3371280: 4288 pgs, 5 pools, 3322 GB data, 835 kobjects
>             4611 GB used, 871 GB / 5563 GB avail
>             838948/2578420 objects degraded (32.537%)
>                    3 down+remapped+peering
>                    8 stale+active+degraded+remapped
>                   85 active+clean
>                    1 stale+incomplete
>                 1088 stale+down+peering
>                  642 active+degraded+remapped
>                    1 incomplete
>                   33 stale+remapped+peering
>                  135 down+peering
>                    1 stale+degraded
>                    1
> stale+active+degraded+remapped+wait_backfill+backfill_toofull
>                  854 active+remapped
>                  234 stale+active+degraded
>                    4 active+degraded+remapped+backfill_toofull
>                   40 active+remapped+backfill_toofull
>                 1079 active+degraded
>                    5 stale+active+clean
>                   74 stale+peering
>
> Take one of the nodes. It holds osds 12 (down & in), 13 (up & in) and 14
> (down & in).
>
> # ceph osd stat
>      osdmap e63276: 15 osds: 7 up, 15 in
>             flags noup,nodown,noout
>
> # ceph daemon osd.12 status
> no valid command found; 10 closest matches:
> config show
> help
> log dump
> get_command_descriptions
> git_version
> config set <var> <val> [<val>...]
> version
> 2
> config get <var>
> 0
> admin_socket: invalid command
>
> # ceph daemon osd.13 status
> { "cluster_fsid": "e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a",
>   "osd_fsid": "d7794b10-2366-4c4f-bb4d-5f11098429b6",
>   "whoami": 13,
>   "state": "active",
>   "oldest_map": 48214,
>   "newest_map": 63276,
>   "num_pgs": 790}
>
> # ceph daemon osd.14 status
> admin_socket: exception getting command descriptions: [Errno 111]
> Connection refused
>
> I'm assuming osds 12 and 14 are acting that way because they're not up,
> but why are they different?

Well, you below indicate that osd.14's log says it crashed on an
internal heartbeat timeout (usually, it got stuck waiting for disk IO
or the kernel/btrfs hung), so that's why. The osd.12 process exists
but isn't "up"; osd.14 doesn't even have a socket to connect to.

>
> In terms of logs, ceph-osd.12.log doesn't go beyond this:
> 2015-02-22 10:38:29.629407 7fd24952c780  0 ceph version 0.80.8
> (69eaad7f8308f21573c604f121956e64679a52a7), process ceph-osd, pid 3813
> 2015-02-22 10:38:29.639802 7fd24952c780  0
> filestore(/var/lib/ceph/osd/ceph-12) mount detected btrfs
> 2015-02-22 10:38:29.876993 7fd24952c780  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_features:
> FIEMAP ioctl is supported and appears to work
> 2015-02-22 10:38:29.877028 7fd24952c780  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_features:
> FIEMAP ioctl is disabled via 'filestore fiemap' config option
> 2015-02-22 10:38:31.487359 7fd24952c780  0
> genericfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_features:
> syscall(SYS_syncfs, fd) fully supported
> 2015-02-22 10:38:31.487538 7fd24952c780  0
> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature:
> CLONE_RANGE ioctl is supported
> 2015-02-22 10:38:31.996129 7fd24952c780  0
> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature:
> SNAP_CREATE is supported
> 2015-02-22 10:38:32.016044 7fd24952c780  0
> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature:
> SNAP_DESTROY is supported
> 2015-02-22 10:38:32.016447 7fd24952c780  0
> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature:
> START_SYNC is supported (transid 388132)
> 2015-02-22 10:38:32.228338 7fd24952c780  0
> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature:
> WAIT_SYNC is supported
> 2015-02-22 10:38:32.315391 7fd24952c780  0
> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature:
> SNAP_CREATE_V2 is supported
> 2015-02-22 10:38:40.051466 7fd24952c780  0
> filestore(/var/lib/ceph/osd/ceph-12) mount: enabling WRITEAHEAD journal
> mode: checkpoint is not enabled
> 2015-02-22 10:38:48.814208 7fd24952c780  1 journal _open
> /var/lib/ceph/osd/ceph-12/journal fd 22: 5367660544 bytes, block size
> 4096 bytes, directio = 1, aio = 1
>
> ceph-osd.13.log seemed to cease earlier today. This osd is up though, so
> I'm guessing that's ok.
>
> ...
> 2015-02-22 10:39:59.743880 7f1a37eb3700 -1 osd.13 63267 heartbeat_check:
> no reply from osd.9 ever on either front or back, first ping sent
> 2015-02-22 10:39:12.431674 (cutoff 2015-02-22 10:39:39.743877)
> 2015-02-22 10:56:28.549016 7f1a139a0700  0 -- 192.168.12.29:6806/3462 >>
> 192.168.12.26:6806/3610 pipe(0x38d2f00 sd=86 :43837 s=2 pgs=15 cs=1 l=0
> c=0x45de89a0).fault with nothing to send, going to standby
> 2015-02-22 10:56:28.612828 7f1a1379e700  0 -- 192.168.12.29:6806/3462 >>
> 192.168.12.25:6801/3290 pipe(0x38d2a00 sd=83 :49115 s=2 pgs=73 cs=1 l=0
> c=0x45de82c0).fault with nothing to send, going to standby
> 2015-02-22 10:58:56.937399 7f1a197bd700  0 log [INF] : 4.4f1 deep-scrub
> ok
> 2015-02-22 10:59:31.122396 7f1a0f869700  0 -- 192.168.12.29:6806/3462 >>
> 192.168.12.26:6806/3610 pipe(0x449d6c80 sd=109 :6806 s=0 pgs=0 cs=0 l=0
> c=0x4610e160).accept connect_seq 2 vs existing 1 state standby
> 2015-02-22 11:00:46.476679 7f1a197bd700  0 log [INF] : 4.753 scrub ok
> 2015-02-22 11:13:03.189352 7f1a197bd700  0 log [INF] : 3.c2 scrub ok
> 2015-02-22 11:16:43.317444 7f1a0f869700  0 -- 192.168.12.29:6806/3462 >>
> 192.168.12.26:6806/3610 pipe(0x449d6c80 sd=109 :6806 s=2 pgs=93 cs=3 l=0
> c=0x45de89a0).fault with nothing to send, going to standby
> 2015-02-22 11:16:43.317591 7f1a1379e700  0 -- 192.168.12.29:6806/3462 >>
> 192.168.12.25:6801/3290 pipe(0x38d2a00 sd=83 :49210 s=2 pgs=93 cs=3 l=0
> c=0x45de82c0).fault with nothing to send, going to standby
> 2015-02-22 11:24:18.421633 7f1a197bd700  0 log [INF] : 3.12b scrub ok
>
> In ceph-osd.14.log, there were a lot of these earlier:
>
> 2015-02-22 10:41:00.643830 7fc4b2f2f700  0 -- 0.0.0.0:6801/3175 >>
> 192.168.12.27:6801/3471 pipe(0xa0822500 sd=102 :58233 s=1 pgs=0 cs=0 l=0
> c=0x2acf6f20).connect claims to be 192.168.12.27:6801/3457 not
> 192.168.12.27:6801/3471 - wrong node!
> 2015-02-22 10:41:00.643924 7fc4b2f2f700  0 -- 0.0.0.0:6801/3175 >>
> 192.168.12.27:6801/3471 pipe(0xa0822500 sd=102 :58233 s=1 pgs=0 cs=0 l=0
> c=0x2acf6f20).fault with nothing to send, going to standby
> 2015-02-22 10:41:03.680471 7fc4a26ca700  0 -- 0.0.0.0:6801/3175 >>
> 192.168.12.27:6806/3653 pipe(0xa0822500 sd=29 :44145 s=1 pgs=0 cs=0 l=0
> c=0x58cdb600).connect claims to be 192.168.12.27:6806/3656 not
> 192.168.12.27:6806/3653 - wrong node!
> 2015-02-22 10:41:03.680631 7fc4a26ca700  0 -- 0.0.0.0:6801/3175 >>
> 192.168.12.27:6806/3653 pipe(0xa0822500 sd=29 :44145 s=1 pgs=0 cs=0 l=0
> c=0x58cdb600).fault with nothing to send, going to standby
> 2015-02-22 10:41:03.683499 7fc4b3030700  0 -- 0.0.0.0:6801/3175 >>
> 192.168.12.26:6801/3331 pipe(0xa0822280 sd=31 :55061 s=1 pgs=0 cs=0 l=0
> c=0x4ebc71e0).connect claims to be 192.168.12.26:6801/3345 not
> 192.168.12.26:6801/3331 - wrong node!
> 2015-02-22 10:41:03.683576 7fc4b3030700  0 -- 0.0.0.0:6801/3175 >>
> 192.168.12.26:6801/3331 pipe(0xa0822280 sd=31 :55061 s=1 pgs=0 cs=0 l=0
> c=0x4ebc71e0).fault with nothing to send, going to standby
> 2015-02-22 10:41:03.685207 7fc4a25c9700  0 -- 0.0.0.0:6801/3175 >>
> 192.168.12.25:6801/3249 pipe(0xa0822c80 sd=32 :49182 s=1 pgs=0 cs=0 l=0
> c=0xa0589e40).connect claims to be 192.168.12.25:6801/3290 not
> 192.168.12.25:6801/3249 - wrong node!
> 2015-02-22 10:41:03.685246 7fc4a25c9700  0 -- 0.0.0.0:6801/3175 >>
> 192.168.12.25:6801/3249 pipe(0xa0822c80 sd=32 :49182 s=1 pgs=0 cs=0 l=0
> c=0xa0589e40).fault with nothing to send, going to standby
> 2015-02-22 10:41:04.089452 7fc4a17c1700  0 -- 0.0.0.0:6801/3175 >>
> 192.168.12.27:6801/3442 pipe(0x3b5f400 sd=110 :58240 s=1 pgs=0 cs=0 l=0
> c=0x69922b00).connect claims to be 192.168.12.27:6801/3457 not
> 192.168.12.27:6801/3442 - wrong node!
> 2015-02-22 10:41:04.090755 7fc4a17c1700  0 -- 0.0.0.0:6801/3175 >>
> 192.168.12.27:6801/3442 pipe(0x3b5f400 sd=110 :58240 s=1 pgs=0 cs=0 l=0
> c=0x69922b00).fault with nothing to send, going to standby
> 2015-02-22 10:41:16.373495 7fc4b2f2f700  0 -- 0.0.0.0:6801/3175 >>
> 192.168.12.28:6811/3880 pipe(0x3b5f680 sd=28 :43896 s=1 pgs=0 cs=0 l=0
> c=0x59128580).connect claims to be 0.0.0.0:6811/3825 not
> 192.168.12.28:6811/3880 - wrong node!
> 2015-02-22 10:41:16.373584 7fc4b2f2f700  0 -- 0.0.0.0:6801/3175 >>
> 192.168.12.28:6811/3880 pipe(0x3b5f680 sd=28 :43896 s=1 pgs=0 cs=0 l=0
> c=0x59128580).fault with nothing to send, going to standby
>
> Then this more recently.
>
>     -3> 2015-02-22 15:41:09.738800 7fc4a6dd2700  5 osd.14 60983
> heartbeat: osd_stat(997 GB used, 826 GB avail, 1858 GB total, peers
> []/[] op hist [])
>     -2> 2015-02-22 15:41:09.959586 7fc4dc4b5700  1 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7fc4cbc6f700' had timed out after
> 600
>     -1> 2015-02-22 15:41:09.959619 7fc4dc4b5700  1 heartbeat_map
> is_healthy 'FileStore::op_tp thread 0x7fc4cbc6f700' had suicide timed
> out after 18000
>      0> 2015-02-22 15:41:09.979187 7fc4dc4b5700 -1
> common/HeartbeatMap.cc: In function 'bool
> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
> time_t)' thread 7fc4dc4b5700 time 2015-02-22 15:41:09.960169
> common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
>
> I've upped 'filestore op thread suicide timeout' even higher now and
> rebooted all nodes to try and start the process again.
>
> After a good 90 minutes, I'm seeing the likes of these on another node:
>
> osd 0: alternating between two messages?
>
> 2015-02-22 17:36:49.885502 7fafdf885700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7fafcf303700' had timed out after 600
> 2015-02-22 17:36:49.885532 7fafdf885700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7fafcfd04700' had timed out after 600
> 2015-02-22 17:36:54.885663 7fafdf885700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7fafcf303700' had timed out after 600
> 2015-02-22 17:36:54.885698 7fafdf885700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7fafcfd04700' had timed out after 600
> 2015-02-22 17:36:59.885766 7fafdf885700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7fafcf303700' had timed out after 600
> 2015-02-22 17:36:59.885787 7fafdf885700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7fafcfd04700' had timed out after 600
> 2015-02-22 17:37:04.885940 7fafdf885700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7fafcf303700' had timed out after 600
> 2015-02-22 17:37:04.885968 7fafdf885700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7fafcfd04700' had timed out after 600
> 2015-02-22 17:37:09.886072 7fafdf885700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7fafcf303700' had timed out after 600
> 2015-02-22 17:37:09.886103 7fafdf885700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7fafcfd04700' had timed out after 600

That's not what I'd expect to see (it appears to have timed out and
not be recognizing it?) but I don't look at these things too often so
maybe that's the normal indication that heartbeats are failing.

>
> Osd 11: alternating between two messages?
>
> 2015-02-22 17:36:08.298415 7f983d949700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7f983098a700' had timed out after 600
> 2015-02-22 17:36:13.298487 7f983d949700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7f982ff89700' had timed out after 600
> 2015-02-22 17:36:13.298512 7f983d949700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7f983098a700' had timed out after 600
> 2015-02-22 17:36:18.298653 7f983d949700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7f982ff89700' had timed out after 600
> 2015-02-22 17:36:18.298740 7f983d949700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7f983098a700' had timed out after 600
> 2015-02-22 17:36:23.298808 7f983d949700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7f982ff89700' had timed out after 600
> 2015-02-22 17:36:23.298830 7f983d949700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7f983098a700' had timed out after 600
> 2015-02-22 17:36:28.298921 7f983d949700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7f982ff89700' had timed out after 600
> 2015-02-22 17:36:28.298949 7f983d949700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7f983098a700' had timed out after 600
> 2015-02-22 17:36:33.299003 7f983d949700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7f982ff89700' had timed out after 600
>
> And osd 3:
>
> 2015-02-22 16:05:35.309159 7fe6d4912700  0 -- 0.0.0.0:6801/3007 >>
> 192.168.12.29:6806/3462 pipe(0x3d1eef00 sd=112 :60063 s=1 pgs=0 cs=0 l=0
> c=0x5b689a20).connect claims to be 0.0.0.0:6806/3359 not
> 192.168.12.29:6806/3462 - wrong node!
> 2015-02-22 16:05:35.309245 7fe6d4912700  0 -- 0.0.0.0:6801/3007 >>
> 192.168.12.29:6806/3462 pipe(0x3d1eef00 sd=112 :60063 s=1 pgs=0 cs=0 l=0
> c=0x5b689a20).fault with nothing to send, going to standby
> 2015-02-22 16:05:40.045763 7fe6d4d26700  0 -- 0.0.0.0:6801/3007 >>
> 192.168.12.26:6806/3610 pipe(0x3d1eea00 sd=85 :58392 s=1 pgs=0 cs=0 l=0
> c=0x5b150dc0).connect claims to be 0.0.0.0:6806/3563 not
> 192.168.12.26:6806/3610 - wrong node!
> 2015-02-22 16:05:40.045857 7fe6d4d26700  0 -- 0.0.0.0:6801/3007 >>
> 192.168.12.26:6806/3610 pipe(0x3d1eea00 sd=85 :58392 s=1 pgs=0 cs=0 l=0
> c=0x5b150dc0).fault with nothing to send, going to standby
> 2015-02-22 16:05:43.003097 7fe6d4e27700  0 -- 0.0.0.0:6801/3007 >>
> 192.168.12.27:6801/3457 pipe(0x3d1ee500 sd=84 :45966 s=1 pgs=0 cs=0 l=0
> c=0x5b150b00).connect claims to be 0.0.0.0:6801/3047 not
> 192.168.12.27:6801/3457 - wrong node!
> 2015-02-22 16:05:43.003187 7fe6d4e27700  0 -- 0.0.0.0:6801/3007 >>
> 192.168.12.27:6801/3457 pipe(0x3d1ee500 sd=84 :45966 s=1 pgs=0 cs=0 l=0
> c=0x5b150b00).fault with nothing to send, going to standby
> 2015-02-22 16:06:12.223369 7fe6d4b24700  0 -- 0.0.0.0:6801/3007 >>
> 192.168.12.27:6806/3656 pipe(0x3d1eec80 sd=73 :41657 s=1 pgs=0 cs=0 l=0
> c=0x5b62eb00).connect claims to be 0.0.0.0:6806/3251 not
> 192.168.12.27:6806/3656 - wrong node!
> 2015-02-22 16:06:12.223455 7fe6d4b24700  0 -- 0.0.0.0:6801/3007 >>
> 192.168.12.27:6806/3656 pipe(0x3d1eec80 sd=73 :41657 s=1 pgs=0 cs=0 l=0
> c=0x5b62eb00).fault with nothing to send, going to standby
> 2015-02-22 16:07:50.036067 7fe6d4c25700  0 -- 0.0.0.0:6801/3007 >>
> 192.168.12.28:6811/3825 pipe(0x3d1ee780 sd=83 :40423 s=1 pgs=0 cs=0 l=0
> c=0x5b1511e0).connect claims to be 0.0.0.0:6811/3803 not
> 192.168.12.28:6811/3825 - wrong node!
> 2015-02-22 16:07:50.036124 7fe6d4c25700  0 -- 0.0.0.0:6801/3007 >>
> 192.168.12.28:6811/3825 pipe(0x3d1ee780 sd=83 :40423 s=1 pgs=0 cs=0 l=0
> c=0x5b1511e0).fault with nothing to send, going to standby

These ones are either peer connections being weird (probably because
of the noup/nodown flags) or the messenger letting connections
disappear because they're idle.


>
> ----
>
> Currently, nodes CPUs certainly seem to be busy, but I don't appear to
> be seeing any changes in the status of the cluster. I've read about a
> similar problem a while back where the osds are lagging with their
> processing of maps, and they simply need to catch up. Is it likely that
> this is the same? Is there much I can do to help things along, or do I
> just need to be patient? CPU is being used still, after all, even if the
> output of 'ceph status' looks the same as it did an hour and a half ago
> when the machines were restarted.
>
> Update: since I'm having to try to send this email again, some time has
> passed, and the cluster nodes have been up about 21.5 hours. CPU usage
> is as approx as follows:
> 192.168.12.25           n54l    50%
> 192.168.12.26           n54l    1%
> 192.168.12.27           n36l    10%
> 192.168.12.28           n36l    1%
> 192.168.12.29           n54l    10%
>
> So one of the nodes is much more busy than the others. But is it
> actually doing anything?

That's hard to tell from what you're showing here. I'd probably unset
the noup flag and try to bring up one extra OSD process at a time on
each node. That will let them peer and get data where it's supposed to
be without generating maps every time one of the nodes crashes because
it's out of memory or the disk has gotten overloaded.

Also, you want to find out why they're dying. That's probably the root
cause of your issues; I imagine the backfilling invoked by starting up
a brand-new OSD drove some of them over the edge.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to