Re: [ceph-users] Oeps: lost cluster with: ceph osd require-osd-release luminous

Josh Durgin Tue, 12 Sep 2017 12:13:44 -0700

Could you post your crushmap? PGs mapping to no OSDs is a symptom of something 
wrong there.



You can stop the osds from changing position at startup with 'osd crush update 
on start = false':


http://docs.ceph.com/docs/master/rados/operations/crush-map/#crush-location


Josh

Sent from Nine
________________________________
From: Jan-Willem Michels <[email protected]>
Sent: Sep 11, 2017 23:50
To: [email protected]
Subject: [ceph-users] Oeps: lost cluster with: ceph osd require-osd-release 
luminous

> We have a kraken cluster,  at the time newly build, with bluestore enabled. 
> it is 8 systems, with each 10 disks 10TB ,  and each computer has 1 NVME 
> 2TB disk 
> 3 monitor etc 
> About 700 TB and 300TB used. Mainly S3 objectstore 
>
> Of course there is more to the story:  We have one strange thing in our 
> cluster. 
> We tried  to create two pools of storage, default and ssd, and created a 
> new crush rule. 
> Worked without problems for months 
> But when we restart a computer / nvme-osd, it would "forget" that the 
> nvme should be connected the SSD pool ( for that particular computer). 
> Since we don't restart systems, we didn't notice that. 
> The nvme would appear back a default  pool. 
> When we re-apply the same crush rule again it would go back to the SSD 
> pool. 
> All while data kept working on the nvme disks 
>
> Clearly something is not ideal there. And luminous has a different 
> approach to separating  SSD from HDD. 
> So we thought first go to luminous 12.2.0 and later see how we fix this. 
>
> We did an upgrade to luminous and that went well. That requires a reboot 
> / restart off osd's, so all nvme devices where a default. 
> Reapplying the crush rule  brought them back to the ssd pool. 
> Also while doing the upgrade we switched off in ceph.conf the rule: 
> # enable experimental unrecoverable data corrupting features = 
> bluestore, sine in luminous that was no problem 
>
> Everything was working fine. 
> In Ceph -s we had this health warning 
>
>              all OSDs are running luminous or later but 
> require_osd_release < luminous 
>
> So i thought i would set the minimum  OSD version to luminous with; 
>
> ceph osd require-osd-release luminous 
>
> To us that seemed nothing more than a minimum software version that was 
> required to connect tot the cluster 
> the system answered back 
>
> recovery_deletes is set 
>
> and that was it, the same second, ceph-s went to "0" 
>
>   ceph -s 
>    cluster: 
>      id:     5bafad08-31b2-4716-be77-07ad2e2647eb 
>      health: HEALTH_WARN 
>              noout flag(s) set 
>              Reduced data availability: 3248 pgs inactive 
>              Degraded data redundancy: 3248 pgs unclean 
>
>    services: 
>      mon: 3 daemons, quorum Ceph-Mon1,Ceph-Mon2,Ceph-Mon3 
>      mgr: Ceph-Mon2(active), standbys: Ceph-Mon3, Ceph-Mon1 
>      osd: 88 osds: 88 up, 88 in; 297 remapped pgs 
>           flags noout 
>
>    data: 
>      pools:   26 pools, 3248 pgs 
>      objects: 0 objects, 0 bytes 
>      usage:   0 kB used, 0 kB / 0 kB avail 
>      pgs:     100.000% pgs unknown 
>               3248 unknown 
>
> And it was something like this. The errors (apart  from the scrub error) 
> you see would where from the upgrade / restarting, and I would expect 
> them to go away very fast. 
>
> ceph -s 
>    cluster: 
>      id:     5bafad08-31b2-4716-be77-07ad2e2647eb 
>      health: HEALTH_ERR 
>              385 pgs backfill_wait 
>              5 pgs backfilling 
>              135 pgs degraded 
>              1 pgs inconsistent 
>              1 pgs peering 
>              4 pgs recovering 
>              131 pgs recovery_wait 
>              98 pgs stuck degraded 
>              525 pgs stuck unclean 
>              recovery 119/612465488 objects degraded (0.000%) 
>              recovery 24/612465488 objects misplaced (0.000%) 
>              1 scrub errors 
>              noout flag(s) set 
>              all OSDs are running luminous or later but 
> require_osd_release < luminous 
>
>    services: 
>      mon: 3 daemons, quorum Ceph-Mon1,Ceph-Mon2,Ceph-Mon3 
>      mgr: Ceph-Mon2(active), standbys: Ceph-Mon1, Ceph-Mon3 
>      osd: 88 osds: 88 up, 88 in; 387 remapped pgs 
>           flags noout 
>
>    data: 
>      pools:   26 pools, 3248 pgs 
>      objects: 87862k objects, 288 TB 
>      usage:   442 TB used, 300 TB / 742 TB avail 
>      pgs:     0.031% pgs not active 
>               119/612465488 objects degraded (0.000%) 
>               24/612465488 objects misplaced (0.000%) 
>               2720 active+clean 
>               385  active+remapped+backfill_wait 
>               131  active+recovery_wait+degraded 
>               5    active+remapped+backfilling 
>               4    active+recovering+degraded 
>               1    active+clean+inconsistent 
>               1    peering 
>               1    active+clean+scrubbing+deep 
>
>    io: 
>      client:   34264 B/s rd, 2091 kB/s wr, 38 op/s rd, 48 op/s wr 
>      recovery: 4235 kB/s, 6 objects/s 
>
> current ceph health detail 
>
> HEALTH_WARN noout flag(s) set; Reduced data availability: 3248 pgs 
> inactive; Degraded data redundancy: 3248 pgs unclean 
> OSDMAP_FLAGS noout flag(s) set 
> PG_AVAILABILITY Reduced data availability: 3248 pgs inactive 
>      pg 15.7cd is stuck inactive for 24780.157341, current state 
> unknown, last acting [] 
>      pg 15.7ce is stuck inactive for 24780.157341, current state 
> unknown, last acting [] 
>      pg 15.7cf is stuck inactive for 24780.157341, current state 
> unknown, last acting [] 
> .. 
>      pg 15.7ff is stuck inactive for 24728.059692, current state 
> unknown, last acting [] 
> PG_DEGRADED Degraded data redundancy: 3248 pgs unclean 
>      pg 15.7cd is stuck unclean for 24728.059692, current state unknown, 
> last acting [] 
>      pg 15.7ce is stuck unclean for 24728.059692, current state unknown, 
> last acting [] 
> .... 
>      pg 15.7fc is stuck unclean for 21892.783340, current state unknown, 
> last acting [] 
>      pg 15.7fd is stuck unclean for 21892.783340, current state unknown, 
> last acting [] 
>      pg 15.7fe is stuck unclean for 21892.783340, current state unknown, 
> last acting [] 
>      pg 15.7ff is stuck unclean for 21892.783340, current state unknown, 
> last acting [] 
>
>   ceph pg dump_stuck unclean | more 
> 15.46b  unknown []         -1     []             -1 
> 15.46a  unknown []         -1     []             -1 
> 15.469  unknown []         -1     []             -1 
> 15.468  unknown []         -1     []             -1 
> 15.467  unknown []         -1     []             -1 
> 15.466  unknown []         -1     []             -1 
> 15.465  unknown []         -1     []             -1 
> 15.464  unknown []         -1     []             -1 
> 15.463  unknown []         -1     []             -1 
> .... 
>
>
> Any idea's 
>
> Greetings JW 
>
> _______________________________________________ 
> ceph-users mailing list 
> [email protected] 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Oeps: lost cluster with: ceph osd require-osd-release luminous

Reply via email to