Re: [ceph-users] Node crash, filesytem not usable

Marc Roos Sun, 13 May 2018 02:30:33 -0700

 
In luminous 
osd_recovery_threads = osd_disk_threads ?
osd_recovery_sleep = osd_recovery_sleep_hdd ?


Or is this speeding up recovery, a lot different in luminous?

[@~]# ceph daemon osd.0 config show | grep osd | grep thread
    "osd_command_thread_suicide_timeout": "900",
    "osd_command_thread_timeout": "600",
    "osd_disk_thread_ioprio_class": "",
    "osd_disk_thread_ioprio_priority": "-1",
    "osd_disk_threads": "1",
    "osd_op_num_threads_per_shard": "0",
    "osd_op_num_threads_per_shard_hdd": "1",
    "osd_op_num_threads_per_shard_ssd": "2",
    "osd_op_thread_suicide_timeout": "150",
    "osd_op_thread_timeout": "15",
    "osd_peering_wq_threads": "2",
    "osd_recovery_thread_suicide_timeout": "300",
    "osd_recovery_thread_timeout": "30",
    "osd_remove_thread_suicide_timeout": "36000",
    "osd_remove_thread_timeout": "3600",

-----Original Message-----
From: Webert de Souza Lima [mailto:[email protected]] 
Sent: vrijdag 11 mei 2018 20:34
To: ceph-users
Subject: Re: [ceph-users] Node crash, filesytem not usable

This message seems to be very concerning:
 >            mds0: Metadata damage detected


but for the rest, the cluster seems still to be recovering. you could 
try to seep thing up with ceph tell, like:

ceph tell osd.* injectargs --osd_max_backfills=10

ceph tell osd.* injectargs --osd_recovery_sleep=0.0

ceph tell osd.* injectargs --osd_recovery_threads=2



Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ


On Fri, May 11, 2018 at 3:06 PM Daniel Davidson 
<[email protected]> wrote:


        Below id the information you were asking for.  I think they are 
size=2, min size=1. 
        
        Dan
        
        # ceph status
            cluster 7bffce86-9d7b-4bdf-a9c9-67670e68ca77                    
                                                                         
                                                                         
          
             health HEALTH_ERR                                              
                                                                         
                                                                         
          
                    140 pgs are stuck inactive for more than 300 seconds
                    64 pgs backfill_wait
                    76 pgs backfilling
                    140 pgs degraded
                    140 pgs stuck degraded
                    140 pgs stuck inactive
                    140 pgs stuck unclean
                    140 pgs stuck undersized
                    140 pgs undersized
                    210 requests are blocked > 32 sec
                    recovery 38725029/695508092 objects degraded (5.568%)
                    recovery 10844554/695508092 objects misplaced (1.559%)
                    mds0: Metadata damage detected
                    mds0: Behind on trimming (71/30)
                    noscrub,nodeep-scrub flag(s) set
             monmap e3: 4 mons at 
{ceph-0=172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:
6789/0,ceph-3=172.16.31.4:6789/0}
                    election epoch 824, quorum 0,1,2,3 
ceph-0,ceph-1,ceph-2,ceph-3
              fsmap e144928: 1/1/1 up {0=ceph-0=up:active}, 1 up:standby
             osdmap e35814: 32 osds: 30 up, 30 in; 140 remapped pgs
                    flags 
noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
              pgmap v43142427: 1536 pgs, 2 pools, 762 TB data, 331 Mobjects
                    1444 TB used, 1011 TB / 2455 TB avail
                    38725029/695508092 objects degraded (5.568%)
                    10844554/695508092 objects misplaced (1.559%)
                        1396 active+clean
                          76 
undersized+degraded+remapped+backfilling+peered
                          64 
undersized+degraded+remapped+wait_backfill+peered
        recovery io 1244 MB/s, 1612 keys/s, 705 objects/s
        
        ID  WEIGHT     TYPE NAME        UP/DOWN REWEIGHT PRIMARY-AFFINITY 
         -1 2619.54541 root default                                       
         -2  163.72159     host ceph-0                                    
          0   81.86079         osd.0         up  1.00000          1.00000 
          1   81.86079         osd.1         up  1.00000          1.00000 
         -3  163.72159     host ceph-1                                    
          2   81.86079         osd.2         up  1.00000          1.00000 
          3   81.86079         osd.3         up  1.00000          1.00000 
         -4  163.72159     host ceph-2                                    
          8   81.86079         osd.8         up  1.00000          1.00000 
          9   81.86079         osd.9         up  1.00000          1.00000 
         -5  163.72159     host ceph-3                                    
         10   81.86079         osd.10        up  1.00000          1.00000 
         11   81.86079         osd.11        up  1.00000          1.00000 
         -6  163.72159     host ceph-4                                    
          4   81.86079         osd.4         up  1.00000          1.00000 
          5   81.86079         osd.5         up  1.00000          1.00000 
         -7  163.72159     host ceph-5                                    
          6   81.86079         osd.6         up  1.00000          1.00000 
          7   81.86079         osd.7         up  1.00000          1.00000 
         -8  163.72159     host ceph-6                                    
         12   81.86079         osd.12        up  0.79999          1.00000 
         13   81.86079         osd.13        up  1.00000          1.00000 
         -9  163.72159     host ceph-7                                    
         14   81.86079         osd.14        up  1.00000          1.00000 
         15   81.86079         osd.15        up  1.00000          1.00000 
        -10  163.72159     host ceph-8                                    
         16   81.86079         osd.16        up  1.00000          1.00000 
         17   81.86079         osd.17        up  1.00000          1.00000 
        -11  163.72159     host ceph-9                                    
         18   81.86079         osd.18        up  1.00000          1.00000 
         19   81.86079         osd.19        up  1.00000          1.00000 
        -12  163.72159     host ceph-10                                   
         20   81.86079         osd.20        up  1.00000          1.00000 
         21   81.86079         osd.21        up  1.00000          1.00000 
        -13  163.72159     host ceph-11                                   
         22   81.86079         osd.22        up  1.00000          1.00000 
         23   81.86079         osd.23        up  1.00000          1.00000 
        -14  163.72159     host ceph-12                                   
         24   81.86079         osd.24        up  1.00000          1.00000 
         25   81.86079         osd.25        up  1.00000          1.00000 
        -15  163.72159     host ceph-13                                   
         26   81.86079         osd.26      down        0          1.00000 
         27   81.86079         osd.27      down        0          1.00000 
        -16  163.72159     host ceph-14                                   
         28   81.86079         osd.28        up  1.00000          1.00000 
         29   81.86079         osd.29        up  1.00000          1.00000 
        -17  163.72159     host ceph-15                                   
         30   81.86079         osd.30        up  1.00000          1.00000 
         31   81.86079         osd.31        up  1.00000          1.00000 
        
        
        
        On 05/11/2018 11:56 AM, David Turner wrote:
        

                What are some outputs of commands to show us the state of your 
cluster.  Most notable is `ceph status` but `ceph osd tree` would be 
helpful. What are the size of the pools in your cluster?  Are they all 
size=3 min_size=2?

                On Fri, May 11, 2018 at 12:05 PM Daniel Davidson 
<[email protected]> wrote:
                

                        Hello,
                        
                        Today we had a node crash, and looking at it, it seems 
there is a 
                        problem with the RAID controller, so it is not coming 
back up, maybe 
                        ever.  It corrupted the local filesytem for the ceph 
storage there.
                        
                        The remainder of our storage (10.2.10) cluster is 
running, and it looks 
                        to be repairing and our min_size is set to 2.  
Normally, 
I would expect 
                        that the system would keep running normally from and 
end 
user 
                        perspective when this happens, but the system is down. 
All mounts that 
                        were up when this started look to be stale, and new 
mounts give the 
                        following error:
                        
                        # mount -t ceph ceph-0:/ /test/ -o 
                        
name=admin,secretfile=/etc/ceph/admin.secret,noatime,_netdev,rbytes
                        mount error 5 = Input/output error
                        
                        Any suggestions?
                        
                        Dan
                        
                        _______________________________________________
                        ceph-users mailing list
                        [email protected]
                        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
                        

        
        

        _______________________________________________
        ceph-users mailing list
        [email protected]
        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
        


_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Node crash, filesytem not usable

Reply via email to