Re: [ceph-users] Node crash, filesytem not usable

Webert de Souza Lima Fri, 11 May 2018 11:34:18 -0700

This message seems to be very concerning:
 >            mds0: Metadata damage detected


but for the rest, the cluster seems still to be recovering. you could try
to seep thing up with ceph tell, like:

ceph tell osd.* injectargs --osd_max_backfills=10
ceph tell osd.* injectargs --osd_recovery_sleep=0.0
ceph tell osd.* injectargs --osd_recovery_threads=2


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Fri, May 11, 2018 at 3:06 PM Daniel Davidson <[email protected]>
wrote:

> Below id the information you were asking for.  I think they are size=2,
> min size=1.
>
> Dan
>
> # ceph status
>     cluster
> 7bffce86-9d7b-4bdf-a9c9-67670e68ca77
>
>      health
> HEALTH_ERR
>
>             140 pgs are stuck inactive for more than 300 seconds
>             64 pgs backfill_wait
>             76 pgs backfilling
>             140 pgs degraded
>             140 pgs stuck degraded
>             140 pgs stuck inactive
>             140 pgs stuck unclean
>             140 pgs stuck undersized
>             140 pgs undersized
>             210 requests are blocked > 32 sec
>             recovery 38725029/695508092 objects degraded (5.568%)
>             recovery 10844554/695508092 objects misplaced (1.559%)
>             mds0: Metadata damage detected
>             mds0: Behind on trimming (71/30)
>             noscrub,nodeep-scrub flag(s) set
>      monmap e3: 4 mons at {ceph-0=
> 172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:6789/0,ceph-3=172.16.31.4:6789/0
> }
>             election epoch 824, quorum 0,1,2,3 ceph-0,ceph-1,ceph-2,ceph-3
>       fsmap e144928: 1/1/1 up {0=ceph-0=up:active}, 1 up:standby
>      osdmap e35814: 32 osds: 30 up, 30 in; 140 remapped pgs
>             flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>       pgmap v43142427: 1536 pgs, 2 pools, 762 TB data, 331 Mobjects
>             1444 TB used, 1011 TB / 2455 TB avail
>             38725029/695508092 objects degraded (5.568%)
>             10844554/695508092 objects misplaced (1.559%)
>                 1396 active+clean
>                   76 undersized+degraded+remapped+backfilling+peered
>                   64 undersized+degraded+remapped+wait_backfill+peered
> recovery io 1244 MB/s, 1612 keys/s, 705 objects/s
>
> ID  WEIGHT     TYPE NAME        UP/DOWN REWEIGHT PRIMARY-AFFINITY
>  -1 2619.54541 root default
>  -2  163.72159     host ceph-0
>   0   81.86079         osd.0         up  1.00000          1.00000
>   1   81.86079         osd.1         up  1.00000          1.00000
>  -3  163.72159     host ceph-1
>   2   81.86079         osd.2         up  1.00000          1.00000
>   3   81.86079         osd.3         up  1.00000          1.00000
>  -4  163.72159     host ceph-2
>   8   81.86079         osd.8         up  1.00000          1.00000
>   9   81.86079         osd.9         up  1.00000          1.00000
>  -5  163.72159     host ceph-3
>  10   81.86079         osd.10        up  1.00000          1.00000
>  11   81.86079         osd.11        up  1.00000          1.00000
>  -6  163.72159     host ceph-4
>   4   81.86079         osd.4         up  1.00000          1.00000
>   5   81.86079         osd.5         up  1.00000          1.00000
>  -7  163.72159     host ceph-5
>   6   81.86079         osd.6         up  1.00000          1.00000
>   7   81.86079         osd.7         up  1.00000          1.00000
>  -8  163.72159     host ceph-6
>  12   81.86079         osd.12        up  0.79999          1.00000
>  13   81.86079         osd.13        up  1.00000          1.00000
>  -9  163.72159     host ceph-7
>  14   81.86079         osd.14        up  1.00000          1.00000
>  15   81.86079         osd.15        up  1.00000          1.00000
> -10  163.72159     host ceph-8
>  16   81.86079         osd.16        up  1.00000          1.00000
>  17   81.86079         osd.17        up  1.00000          1.00000
> -11  163.72159     host ceph-9
>  18   81.86079         osd.18        up  1.00000          1.00000
>  19   81.86079         osd.19        up  1.00000          1.00000
> -12  163.72159     host ceph-10
>  20   81.86079         osd.20        up  1.00000          1.00000
>  21   81.86079         osd.21        up  1.00000          1.00000
> -13  163.72159     host ceph-11
>  22   81.86079         osd.22        up  1.00000          1.00000
>  23   81.86079         osd.23        up  1.00000          1.00000
> -14  163.72159     host ceph-12
>  24   81.86079         osd.24        up  1.00000          1.00000
>  25   81.86079         osd.25        up  1.00000          1.00000
> -15  163.72159     host ceph-13
>  26   81.86079         osd.26      down        0          1.00000
>  27   81.86079         osd.27      down        0          1.00000
> -16  163.72159     host ceph-14
>  28   81.86079         osd.28        up  1.00000          1.00000
>  29   81.86079         osd.29        up  1.00000          1.00000
> -17  163.72159     host ceph-15
>  30   81.86079         osd.30        up  1.00000          1.00000
>  31   81.86079         osd.31        up  1.00000          1.00000
>
>
>
> On 05/11/2018 11:56 AM, David Turner wrote:
>
> What are some outputs of commands to show us the state of your cluster.
> Most notable is `ceph status` but `ceph osd tree` would be helpful. What
> are the size of the pools in your cluster?  Are they all size=3 min_size=2?
>
> On Fri, May 11, 2018 at 12:05 PM Daniel Davidson <[email protected]>
> wrote:
>
>> Hello,
>>
>> Today we had a node crash, and looking at it, it seems there is a
>> problem with the RAID controller, so it is not coming back up, maybe
>> ever.  It corrupted the local filesytem for the ceph storage there.
>>
>> The remainder of our storage (10.2.10) cluster is running, and it looks
>> to be repairing and our min_size is set to 2.  Normally, I would expect
>> that the system would keep running normally from and end user
>> perspective when this happens, but the system is down. All mounts that
>> were up when this started look to be stale, and new mounts give the
>> following error:
>>
>> # mount -t ceph ceph-0:/ /test/ -o
>> name=admin,secretfile=/etc/ceph/admin.secret,noatime,_netdev,rbytes
>> mount error 5 = Input/output error
>>
>> Any suggestions?
>>
>> Dan
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Node crash, filesytem not usable

Reply via email to