This message seems to be very concerning: > mds0: Metadata damage detected
but for the rest, the cluster seems still to be recovering. you could try to seep thing up with ceph tell, like: ceph tell osd.* injectargs --osd_max_backfills=10 ceph tell osd.* injectargs --osd_recovery_sleep=0.0 ceph tell osd.* injectargs --osd_recovery_threads=2 Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* On Fri, May 11, 2018 at 3:06 PM Daniel Davidson <[email protected]> wrote: > Below id the information you were asking for. I think they are size=2, > min size=1. > > Dan > > # ceph status > cluster > 7bffce86-9d7b-4bdf-a9c9-67670e68ca77 > > health > HEALTH_ERR > > 140 pgs are stuck inactive for more than 300 seconds > 64 pgs backfill_wait > 76 pgs backfilling > 140 pgs degraded > 140 pgs stuck degraded > 140 pgs stuck inactive > 140 pgs stuck unclean > 140 pgs stuck undersized > 140 pgs undersized > 210 requests are blocked > 32 sec > recovery 38725029/695508092 objects degraded (5.568%) > recovery 10844554/695508092 objects misplaced (1.559%) > mds0: Metadata damage detected > mds0: Behind on trimming (71/30) > noscrub,nodeep-scrub flag(s) set > monmap e3: 4 mons at {ceph-0= > 172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:6789/0,ceph-3=172.16.31.4:6789/0 > } > election epoch 824, quorum 0,1,2,3 ceph-0,ceph-1,ceph-2,ceph-3 > fsmap e144928: 1/1/1 up {0=ceph-0=up:active}, 1 up:standby > osdmap e35814: 32 osds: 30 up, 30 in; 140 remapped pgs > flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds > pgmap v43142427: 1536 pgs, 2 pools, 762 TB data, 331 Mobjects > 1444 TB used, 1011 TB / 2455 TB avail > 38725029/695508092 objects degraded (5.568%) > 10844554/695508092 objects misplaced (1.559%) > 1396 active+clean > 76 undersized+degraded+remapped+backfilling+peered > 64 undersized+degraded+remapped+wait_backfill+peered > recovery io 1244 MB/s, 1612 keys/s, 705 objects/s > > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > -1 2619.54541 root default > -2 163.72159 host ceph-0 > 0 81.86079 osd.0 up 1.00000 1.00000 > 1 81.86079 osd.1 up 1.00000 1.00000 > -3 163.72159 host ceph-1 > 2 81.86079 osd.2 up 1.00000 1.00000 > 3 81.86079 osd.3 up 1.00000 1.00000 > -4 163.72159 host ceph-2 > 8 81.86079 osd.8 up 1.00000 1.00000 > 9 81.86079 osd.9 up 1.00000 1.00000 > -5 163.72159 host ceph-3 > 10 81.86079 osd.10 up 1.00000 1.00000 > 11 81.86079 osd.11 up 1.00000 1.00000 > -6 163.72159 host ceph-4 > 4 81.86079 osd.4 up 1.00000 1.00000 > 5 81.86079 osd.5 up 1.00000 1.00000 > -7 163.72159 host ceph-5 > 6 81.86079 osd.6 up 1.00000 1.00000 > 7 81.86079 osd.7 up 1.00000 1.00000 > -8 163.72159 host ceph-6 > 12 81.86079 osd.12 up 0.79999 1.00000 > 13 81.86079 osd.13 up 1.00000 1.00000 > -9 163.72159 host ceph-7 > 14 81.86079 osd.14 up 1.00000 1.00000 > 15 81.86079 osd.15 up 1.00000 1.00000 > -10 163.72159 host ceph-8 > 16 81.86079 osd.16 up 1.00000 1.00000 > 17 81.86079 osd.17 up 1.00000 1.00000 > -11 163.72159 host ceph-9 > 18 81.86079 osd.18 up 1.00000 1.00000 > 19 81.86079 osd.19 up 1.00000 1.00000 > -12 163.72159 host ceph-10 > 20 81.86079 osd.20 up 1.00000 1.00000 > 21 81.86079 osd.21 up 1.00000 1.00000 > -13 163.72159 host ceph-11 > 22 81.86079 osd.22 up 1.00000 1.00000 > 23 81.86079 osd.23 up 1.00000 1.00000 > -14 163.72159 host ceph-12 > 24 81.86079 osd.24 up 1.00000 1.00000 > 25 81.86079 osd.25 up 1.00000 1.00000 > -15 163.72159 host ceph-13 > 26 81.86079 osd.26 down 0 1.00000 > 27 81.86079 osd.27 down 0 1.00000 > -16 163.72159 host ceph-14 > 28 81.86079 osd.28 up 1.00000 1.00000 > 29 81.86079 osd.29 up 1.00000 1.00000 > -17 163.72159 host ceph-15 > 30 81.86079 osd.30 up 1.00000 1.00000 > 31 81.86079 osd.31 up 1.00000 1.00000 > > > > On 05/11/2018 11:56 AM, David Turner wrote: > > What are some outputs of commands to show us the state of your cluster. > Most notable is `ceph status` but `ceph osd tree` would be helpful. What > are the size of the pools in your cluster? Are they all size=3 min_size=2? > > On Fri, May 11, 2018 at 12:05 PM Daniel Davidson <[email protected]> > wrote: > >> Hello, >> >> Today we had a node crash, and looking at it, it seems there is a >> problem with the RAID controller, so it is not coming back up, maybe >> ever. It corrupted the local filesytem for the ceph storage there. >> >> The remainder of our storage (10.2.10) cluster is running, and it looks >> to be repairing and our min_size is set to 2. Normally, I would expect >> that the system would keep running normally from and end user >> perspective when this happens, but the system is down. All mounts that >> were up when this started look to be stale, and new mounts give the >> following error: >> >> # mount -t ceph ceph-0:/ /test/ -o >> name=admin,secretfile=/etc/ceph/admin.secret,noatime,_netdev,rbytes >> mount error 5 = Input/output error >> >> Any suggestions? >> >> Dan >> >> _______________________________________________ >> ceph-users mailing list >> [email protected] >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
