Could you post your crushmap? PGs mapping to no OSDs is a symptom of something wrong there.
You can stop the osds from changing position at startup with 'osd crush update on start = false': http://docs.ceph.com/docs/master/rados/operations/crush-map/#crush-location Josh Sent from Nine ________________________________ From: Jan-Willem Michels <[email protected]> Sent: Sep 11, 2017 23:50 To: [email protected] Subject: [ceph-users] Oeps: lost cluster with: ceph osd require-osd-release luminous > We have a kraken cluster, at the time newly build, with bluestore enabled. > it is 8 systems, with each 10 disks 10TB , and each computer has 1 NVME > 2TB disk > 3 monitor etc > About 700 TB and 300TB used. Mainly S3 objectstore > > Of course there is more to the story: We have one strange thing in our > cluster. > We tried to create two pools of storage, default and ssd, and created a > new crush rule. > Worked without problems for months > But when we restart a computer / nvme-osd, it would "forget" that the > nvme should be connected the SSD pool ( for that particular computer). > Since we don't restart systems, we didn't notice that. > The nvme would appear back a default pool. > When we re-apply the same crush rule again it would go back to the SSD > pool. > All while data kept working on the nvme disks > > Clearly something is not ideal there. And luminous has a different > approach to separating SSD from HDD. > So we thought first go to luminous 12.2.0 and later see how we fix this. > > We did an upgrade to luminous and that went well. That requires a reboot > / restart off osd's, so all nvme devices where a default. > Reapplying the crush rule brought them back to the ssd pool. > Also while doing the upgrade we switched off in ceph.conf the rule: > # enable experimental unrecoverable data corrupting features = > bluestore, sine in luminous that was no problem > > Everything was working fine. > In Ceph -s we had this health warning > > all OSDs are running luminous or later but > require_osd_release < luminous > > So i thought i would set the minimum OSD version to luminous with; > > ceph osd require-osd-release luminous > > To us that seemed nothing more than a minimum software version that was > required to connect tot the cluster > the system answered back > > recovery_deletes is set > > and that was it, the same second, ceph-s went to "0" > > ceph -s > cluster: > id: 5bafad08-31b2-4716-be77-07ad2e2647eb > health: HEALTH_WARN > noout flag(s) set > Reduced data availability: 3248 pgs inactive > Degraded data redundancy: 3248 pgs unclean > > services: > mon: 3 daemons, quorum Ceph-Mon1,Ceph-Mon2,Ceph-Mon3 > mgr: Ceph-Mon2(active), standbys: Ceph-Mon3, Ceph-Mon1 > osd: 88 osds: 88 up, 88 in; 297 remapped pgs > flags noout > > data: > pools: 26 pools, 3248 pgs > objects: 0 objects, 0 bytes > usage: 0 kB used, 0 kB / 0 kB avail > pgs: 100.000% pgs unknown > 3248 unknown > > And it was something like this. The errors (apart from the scrub error) > you see would where from the upgrade / restarting, and I would expect > them to go away very fast. > > ceph -s > cluster: > id: 5bafad08-31b2-4716-be77-07ad2e2647eb > health: HEALTH_ERR > 385 pgs backfill_wait > 5 pgs backfilling > 135 pgs degraded > 1 pgs inconsistent > 1 pgs peering > 4 pgs recovering > 131 pgs recovery_wait > 98 pgs stuck degraded > 525 pgs stuck unclean > recovery 119/612465488 objects degraded (0.000%) > recovery 24/612465488 objects misplaced (0.000%) > 1 scrub errors > noout flag(s) set > all OSDs are running luminous or later but > require_osd_release < luminous > > services: > mon: 3 daemons, quorum Ceph-Mon1,Ceph-Mon2,Ceph-Mon3 > mgr: Ceph-Mon2(active), standbys: Ceph-Mon1, Ceph-Mon3 > osd: 88 osds: 88 up, 88 in; 387 remapped pgs > flags noout > > data: > pools: 26 pools, 3248 pgs > objects: 87862k objects, 288 TB > usage: 442 TB used, 300 TB / 742 TB avail > pgs: 0.031% pgs not active > 119/612465488 objects degraded (0.000%) > 24/612465488 objects misplaced (0.000%) > 2720 active+clean > 385 active+remapped+backfill_wait > 131 active+recovery_wait+degraded > 5 active+remapped+backfilling > 4 active+recovering+degraded > 1 active+clean+inconsistent > 1 peering > 1 active+clean+scrubbing+deep > > io: > client: 34264 B/s rd, 2091 kB/s wr, 38 op/s rd, 48 op/s wr > recovery: 4235 kB/s, 6 objects/s > > current ceph health detail > > HEALTH_WARN noout flag(s) set; Reduced data availability: 3248 pgs > inactive; Degraded data redundancy: 3248 pgs unclean > OSDMAP_FLAGS noout flag(s) set > PG_AVAILABILITY Reduced data availability: 3248 pgs inactive > pg 15.7cd is stuck inactive for 24780.157341, current state > unknown, last acting [] > pg 15.7ce is stuck inactive for 24780.157341, current state > unknown, last acting [] > pg 15.7cf is stuck inactive for 24780.157341, current state > unknown, last acting [] > .. > pg 15.7ff is stuck inactive for 24728.059692, current state > unknown, last acting [] > PG_DEGRADED Degraded data redundancy: 3248 pgs unclean > pg 15.7cd is stuck unclean for 24728.059692, current state unknown, > last acting [] > pg 15.7ce is stuck unclean for 24728.059692, current state unknown, > last acting [] > .... > pg 15.7fc is stuck unclean for 21892.783340, current state unknown, > last acting [] > pg 15.7fd is stuck unclean for 21892.783340, current state unknown, > last acting [] > pg 15.7fe is stuck unclean for 21892.783340, current state unknown, > last acting [] > pg 15.7ff is stuck unclean for 21892.783340, current state unknown, > last acting [] > > ceph pg dump_stuck unclean | more > 15.46b unknown [] -1 [] -1 > 15.46a unknown [] -1 [] -1 > 15.469 unknown [] -1 [] -1 > 15.468 unknown [] -1 [] -1 > 15.467 unknown [] -1 [] -1 > 15.466 unknown [] -1 [] -1 > 15.465 unknown [] -1 [] -1 > 15.464 unknown [] -1 [] -1 > 15.463 unknown [] -1 [] -1 > .... > > > Any idea's > > Greetings JW > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
