Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]
Hey Burkhard, we did actually restart osd.61, which led to the current status. Best, Nico Burkhard Linke writes:> > On 01/23/2018 08:54 AM, Nico Schottelius wrote: >> Good morning, >> >> the osd.61 actually just crashed and the disk is still intact. However, >> after 8 hours of rebuilding, the unfound objects are still missing: > > *snipsnap* >> >> >> Is there any chance to recover those pgs or did we actually lose data >> with a 2 disk failure? >> >> And is there any way out of this besides going with >> >> ceph pg {pg-id} mark_unfound_lost revert|delete >> >> ? > > Just my 2 cents: > > If the disk is still intact and the data is still readable, you can try > to export the pg content with ceph-objectstore-tool, and import it into > another OSD. > > On the other hand: if the disk is still intact, just restart the OSD? -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]
Hi, On 01/23/2018 08:54 AM, Nico Schottelius wrote: Good morning, the osd.61 actually just crashed and the disk is still intact. However, after 8 hours of rebuilding, the unfound objects are still missing: *snipsnap* Is there any chance to recover those pgs or did we actually lose data with a 2 disk failure? And is there any way out of this besides going with ceph pg {pg-id} mark_unfound_lost revert|delete ? Just my 2 cents: If the disk is still intact and the data is still readable, you can try to export the pg content with ceph-objectstore-tool, and import it into another OSD. On the other hand: if the disk is still intact, just restart the OSD? Regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]
... while trying to locate which VMs are potentially affected by a revert/delete, we noticed that root@server1:~# rados -p one-hdd ls hangs. Where does ceph store the index of block devices found in a pool? And is it possible that this information is in one of the damaged pgs? Nico Nico Schottelius writes: > Good morning, > > the osd.61 actually just crashed and the disk is still intact. However, > after 8 hours of rebuilding, the unfound objects are still missing: > > root@server1:~# ceph -s > cluster: > id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab > health: HEALTH_WARN > noscrub,nodeep-scrub flag(s) set > 111436/3017766 objects misplaced (3.693%) > 9377/1005922 objects unfound (0.932%) > Reduced data availability: 84 pgs inactive > Degraded data redundancy: 277034/3017766 objects degraded > (9.180%), 84 pgs unclean, 84 pgs degraded, 84 pgs undersized > mon server2 is low on available space > > services: > mon: 3 daemons, quorum server5,server3,server2 > mgr: server5(active), standbys: server2, 2, 0, server3 > osd: 54 osds: 54 up, 54 in; 84 remapped pgs > flags noscrub,nodeep-scrub > > data: > pools: 3 pools, 1344 pgs > objects: 982k objects, 3837 GB > usage: 10618 GB used, 39030 GB / 49648 GB avail > pgs: 6.250% pgs not active > 277034/3017766 objects degraded (9.180%) > 111436/3017766 objects misplaced (3.693%) > 9377/1005922 objects unfound (0.932%) > 1260 active+clean > 84 recovery_wait+undersized+degraded+remapped+peered > > io: > client: 68960 B/s rd, 20722 kB/s wr, 12 op/s rd, 77 op/s wr > > We tried restarting osd.61, but ceph health detail does not change > anymore: > > HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 111436/3017886 objects > misplaced (3.69 > 3%); 9377/1005962 objects unfound (0.932%); Reduced data availability: 84 pgs > inacti > ve; Degraded data redundancy: 277034/3017886 objects degraded (9.180%), 84 > pgs uncle > an, 84 pgs degraded, 84 pgs undersized; mon server2 is low on available space > OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set > OBJECT_MISPLACED 111436/3017886 objects misplaced (3.693%) > OBJECT_UNFOUND 9377/1005962 objects unfound (0.932%) > pg 4.fa has 117 unfound objects > pg 4.ff has 107 unfound objects > pg 4.fd has 113 unfound objects > ... > pg 4.2a has 108 unfound objects > > PG_AVAILABILITY Reduced data availability: 84 pgs inactive > pg 4.2a is stuck inactive for 64117.189552, current state > recovery_wait+undersiz > ed+degraded+remapped+peered, last acting [61] > pg 4.31 is stuck inactive for 64117.147636, current state > recovery_wait+undersiz > ed+degraded+remapped+peered, last acting [61] > pg 4.32 is stuck inactive for 64117.178461, current state > recovery_wait+undersiz > ed+degraded+remapped+peered, last acting [61] > pg 4.34 is stuck inactive for 64117.150475, current state > recovery_wait+undersiz > ed+degraded+remapped+peered, last acting [61] > ... > > > PG_DEGRADED Degraded data redundancy: 277034/3017886 objects degraded > (9.180%), 84 pgs unclean, 84 pgs degraded, 84 pgs undersized > pg 4.2a is stuck unclean for 131612.984555, current state > recovery_wait+undersized+degraded+remapped+peered, last acting [61] > pg 4.31 is stuck undersized for 221.568468, current state > recovery_wait+undersized+degraded+remapped+peered, last acting [61] > > > Is there any chance to recover those pgs or did we actually lose data > with a 2 disk failure? > > And is there any way out of this besides going with > > ceph pg {pg-id} mark_unfound_lost revert|delete > > ? > > Best, > > Nico > > p.s.: the ceph 4.2a query: > > { > "state": "recovery_wait+undersized+degraded+remapped+peered", > "snap_trimq": "[]", > "epoch": 17879, > "up": [ > 17, > 13, > 25 > ], > "acting": [ > 61 > ], > "backfill_targets": [ > "13", > "17", > "25" > ], > "actingbackfill": [ > "13", > "17", > "25", > "61" > ], > "info": { > "pgid": "4.2a", > "last_update": "17529'53875", > "last_complete": "17217'45447", > "log_tail": "17090'43812", > "last_user_version": 53875, > "last_backfill": "MAX", > "last_backfill_bitwise": 0, > "purged_snaps": [ > { > "start": "1", > "length": "3" > }, > { > "start": "6", > "length": "8" > }, > { > "start": "10", > "length": "2" > } > ], > "history": { > "epoch_created": 9134, > "epoch_pool_created": 9134, > "last_epoch_started": 17528, > "last_interval_started": 17527, >
Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]
Good morning, the osd.61 actually just crashed and the disk is still intact. However, after 8 hours of rebuilding, the unfound objects are still missing: root@server1:~# ceph -s cluster: id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab health: HEALTH_WARN noscrub,nodeep-scrub flag(s) set 111436/3017766 objects misplaced (3.693%) 9377/1005922 objects unfound (0.932%) Reduced data availability: 84 pgs inactive Degraded data redundancy: 277034/3017766 objects degraded (9.180%), 84 pgs unclean, 84 pgs degraded, 84 pgs undersized mon server2 is low on available space services: mon: 3 daemons, quorum server5,server3,server2 mgr: server5(active), standbys: server2, 2, 0, server3 osd: 54 osds: 54 up, 54 in; 84 remapped pgs flags noscrub,nodeep-scrub data: pools: 3 pools, 1344 pgs objects: 982k objects, 3837 GB usage: 10618 GB used, 39030 GB / 49648 GB avail pgs: 6.250% pgs not active 277034/3017766 objects degraded (9.180%) 111436/3017766 objects misplaced (3.693%) 9377/1005922 objects unfound (0.932%) 1260 active+clean 84 recovery_wait+undersized+degraded+remapped+peered io: client: 68960 B/s rd, 20722 kB/s wr, 12 op/s rd, 77 op/s wr We tried restarting osd.61, but ceph health detail does not change anymore: HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 111436/3017886 objects misplaced (3.69 3%); 9377/1005962 objects unfound (0.932%); Reduced data availability: 84 pgs inacti ve; Degraded data redundancy: 277034/3017886 objects degraded (9.180%), 84 pgs uncle an, 84 pgs degraded, 84 pgs undersized; mon server2 is low on available space OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set OBJECT_MISPLACED 111436/3017886 objects misplaced (3.693%) OBJECT_UNFOUND 9377/1005962 objects unfound (0.932%) pg 4.fa has 117 unfound objects pg 4.ff has 107 unfound objects pg 4.fd has 113 unfound objects ... pg 4.2a has 108 unfound objects PG_AVAILABILITY Reduced data availability: 84 pgs inactive pg 4.2a is stuck inactive for 64117.189552, current state recovery_wait+undersiz ed+degraded+remapped+peered, last acting [61] pg 4.31 is stuck inactive for 64117.147636, current state recovery_wait+undersiz ed+degraded+remapped+peered, last acting [61] pg 4.32 is stuck inactive for 64117.178461, current state recovery_wait+undersiz ed+degraded+remapped+peered, last acting [61] pg 4.34 is stuck inactive for 64117.150475, current state recovery_wait+undersiz ed+degraded+remapped+peered, last acting [61] ... PG_DEGRADED Degraded data redundancy: 277034/3017886 objects degraded (9.180%), 84 pgs unclean, 84 pgs degraded, 84 pgs undersized pg 4.2a is stuck unclean for 131612.984555, current state recovery_wait+undersized+degraded+remapped+peered, last acting [61] pg 4.31 is stuck undersized for 221.568468, current state recovery_wait+undersized+degraded+remapped+peered, last acting [61] Is there any chance to recover those pgs or did we actually lose data with a 2 disk failure? And is there any way out of this besides going with ceph pg {pg-id} mark_unfound_lost revert|delete ? Best, Nico p.s.: the ceph 4.2a query: { "state": "recovery_wait+undersized+degraded+remapped+peered", "snap_trimq": "[]", "epoch": 17879, "up": [ 17, 13, 25 ], "acting": [ 61 ], "backfill_targets": [ "13", "17", "25" ], "actingbackfill": [ "13", "17", "25", "61" ], "info": { "pgid": "4.2a", "last_update": "17529'53875", "last_complete": "17217'45447", "log_tail": "17090'43812", "last_user_version": 53875, "last_backfill": "MAX", "last_backfill_bitwise": 0, "purged_snaps": [ { "start": "1", "length": "3" }, { "start": "6", "length": "8" }, { "start": "10", "length": "2" } ], "history": { "epoch_created": 9134, "epoch_pool_created": 9134, "last_epoch_started": 17528, "last_interval_started": 17527, "last_epoch_clean": 17079, "last_interval_clean": 17078, "last_epoch_split": 0, "last_epoch_marked_full": 0, "same_up_since": 17143, "same_interval_since": 17878, "same_primary_since": 17878, "last_scrub": "17090'44622", "last_scrub_stamp": "2018-01-21 09:37:09.888508", "last_deep_scrub": "17090'42219", "last_deep_scrub_stamp": "2018-01-20 05:05:45.372052", "last_clean_scrub_stamp": "2018-01-21 09:37:09.888508" }, "stats": {
Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]
Weight the remaining disks you added to 0.0. They seem to be a bad batch. This will start moving their data off of them and back onto the rest of the cluster. I generally suggest not to add storage in more than what you can afford to lose, unless you trust your burn-in process. So if you have a host failure domain and size=3, I wouldn't add storage in more than 2 nodes at a time in case the disks die. That way you are much less likely to have scares. I assume this disk was in a third node leaving you with 3 failed disks across 3 hosts? It doesn't seem like these drives are going to work out and I would immediately weight all newly added disks to 0.0 and get back to a point where you are no longer backfilling/recovering PGs and see where things are at from there. On Mon, Jan 22, 2018 at 1:33 PM Nico Schottelius < nico.schottel...@ungleich.ch> wrote: > > While writing, yet another disk (osd.61 now) died and now we have > 172 pgs down: > > [19:32:35] server2:~# ceph -s > cluster: > id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab > health: HEALTH_WARN > noscrub,nodeep-scrub flag(s) set > 21033/2263701 objects misplaced (0.929%) > Reduced data availability: 186 pgs inactive, 172 pgs down > Degraded data redundancy: 67370/2263701 objects degraded > (2.976%), 219 pgs unclean, 46 pgs degraded, 46 pgs undersized > mon server2 is low on available space > > services: > mon: 3 daemons, quorum server5,server3,server2 > mgr: server5(active), standbys: server2, 2, 0, server3 > osd: 54 osds: 53 up, 53 in; 47 remapped pgs > flags noscrub,nodeep-scrub > > data: > pools: 3 pools, 1344 pgs > objects: 736k objects, 2889 GB > usage: 8517 GB used, 36474 GB / 44991 GB avail > pgs: 13.839% pgs not active > 67370/2263701 objects degraded (2.976%) > 21033/2263701 objects misplaced (0.929%) > 1125 active+clean > 172 down > 26 active+undersized+degraded+remapped+backfilling > 14 undersized+degraded+remapped+backfilling+peered > 6active+undersized+degraded+remapped+backfill_wait > 1active+remapped+backfill_wait > > io: > client: 835 kB/s rd, 262 kB/s wr, 16 op/s rd, 25 op/s wr > recovery: 102 MB/s, 26 objects/s > > What is the most sensible way to get out of this situation? > > > > > > David Turner writes: > > > I do remember seeing that exactly. As the number of recovery_wait pgs > > decreased, the number of unfound objects decreased until they were all > > found. Unfortunately it blocked some IO from happening during the > > recovery, but in the long run we ended up with full data integrity again. > > > > On Mon, Jan 22, 2018 at 1:03 PM Nico Schottelius < > > nico.schottel...@ungleich.ch> wrote: > > > >> > >> Hey David, > >> > >> thanks for the fast answer. All our pools are running with size=3, > >> min_size=2 and the two disks were in 2 different hosts. > >> > >> What I am a bit worried about is the output of "ceph pg 4.fa query" (see > >> below) that indicates that ceph already queried all other hosts and did > >> not find the data anywhere. > >> > >> Do you remember having seen something similar? > >> > >> Best, > >> > >> Nico > >> > >> David Turner writes: > >> > >> > I have had the same problem before with unfound objects that happened > >> while > >> > backfilling after losing a drive. We didn't lose drives outside of the > >> > failure domains and ultimately didn't lose any data, but we did have > to > >> > wait until after all of the PGs in recovery_wait state were caught up. > >> So > >> > if the 2 disks you lost were in the same host and your CRUSH rules are > >> set > >> > so that you can lose a host without losing data, then the cluster will > >> > likely find all of the objects by the time it's done backfilling. > With > >> > only losing 2 disks, I wouldn't worry about the missing objects not > >> > becoming found unless you're pool size=2. > >> > > >> > On Mon, Jan 22, 2018 at 11:47 AM Nico Schottelius < > >> > nico.schottel...@ungleich.ch> wrote: > >> > > >> >> > >> >> Hello, > >> >> > >> >> we added about 7 new disks yesterday/today and our cluster became > very > >> >> slow. While the rebalancing took place, 2 of the 7 new added disks > >> >> died. > >> >> > >> >> Our cluster is still recovering, however we spotted that there are a > lot > >> >> of unfound objects. > >> >> > >> >> We lost osd.63 and osd.64, which seem not to be involved into the > sample > >> >> pg that has unfound objects. > >> >> > >> >> We were wondering why there are unfound objects, where they are > coming > >> >> from and if there is a way to recover them? > >> >> > >> >> Any help appreciated, > >> >> > >> >> Best, > >> >> > >> >> Nico > >> >> > >> >> > >> >> Our status is: > >> >> > >> >> cluster: > >> >> id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab > >> >> health: HEALTH_WARN > >>
Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]
While writing, yet another disk (osd.61 now) died and now we have 172 pgs down: [19:32:35] server2:~# ceph -s cluster: id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab health: HEALTH_WARN noscrub,nodeep-scrub flag(s) set 21033/2263701 objects misplaced (0.929%) Reduced data availability: 186 pgs inactive, 172 pgs down Degraded data redundancy: 67370/2263701 objects degraded (2.976%), 219 pgs unclean, 46 pgs degraded, 46 pgs undersized mon server2 is low on available space services: mon: 3 daemons, quorum server5,server3,server2 mgr: server5(active), standbys: server2, 2, 0, server3 osd: 54 osds: 53 up, 53 in; 47 remapped pgs flags noscrub,nodeep-scrub data: pools: 3 pools, 1344 pgs objects: 736k objects, 2889 GB usage: 8517 GB used, 36474 GB / 44991 GB avail pgs: 13.839% pgs not active 67370/2263701 objects degraded (2.976%) 21033/2263701 objects misplaced (0.929%) 1125 active+clean 172 down 26 active+undersized+degraded+remapped+backfilling 14 undersized+degraded+remapped+backfilling+peered 6active+undersized+degraded+remapped+backfill_wait 1active+remapped+backfill_wait io: client: 835 kB/s rd, 262 kB/s wr, 16 op/s rd, 25 op/s wr recovery: 102 MB/s, 26 objects/s What is the most sensible way to get out of this situation? David Turner writes: > I do remember seeing that exactly. As the number of recovery_wait pgs > decreased, the number of unfound objects decreased until they were all > found. Unfortunately it blocked some IO from happening during the > recovery, but in the long run we ended up with full data integrity again. > > On Mon, Jan 22, 2018 at 1:03 PM Nico Schottelius < > nico.schottel...@ungleich.ch> wrote: > >> >> Hey David, >> >> thanks for the fast answer. All our pools are running with size=3, >> min_size=2 and the two disks were in 2 different hosts. >> >> What I am a bit worried about is the output of "ceph pg 4.fa query" (see >> below) that indicates that ceph already queried all other hosts and did >> not find the data anywhere. >> >> Do you remember having seen something similar? >> >> Best, >> >> Nico >> >> David Turner writes: >> >> > I have had the same problem before with unfound objects that happened >> while >> > backfilling after losing a drive. We didn't lose drives outside of the >> > failure domains and ultimately didn't lose any data, but we did have to >> > wait until after all of the PGs in recovery_wait state were caught up. >> So >> > if the 2 disks you lost were in the same host and your CRUSH rules are >> set >> > so that you can lose a host without losing data, then the cluster will >> > likely find all of the objects by the time it's done backfilling. With >> > only losing 2 disks, I wouldn't worry about the missing objects not >> > becoming found unless you're pool size=2. >> > >> > On Mon, Jan 22, 2018 at 11:47 AM Nico Schottelius < >> > nico.schottel...@ungleich.ch> wrote: >> > >> >> >> >> Hello, >> >> >> >> we added about 7 new disks yesterday/today and our cluster became very >> >> slow. While the rebalancing took place, 2 of the 7 new added disks >> >> died. >> >> >> >> Our cluster is still recovering, however we spotted that there are a lot >> >> of unfound objects. >> >> >> >> We lost osd.63 and osd.64, which seem not to be involved into the sample >> >> pg that has unfound objects. >> >> >> >> We were wondering why there are unfound objects, where they are coming >> >> from and if there is a way to recover them? >> >> >> >> Any help appreciated, >> >> >> >> Best, >> >> >> >> Nico >> >> >> >> >> >> Our status is: >> >> >> >> cluster: >> >> id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab >> >> health: HEALTH_WARN >> >> 261953/3006663 objects misplaced (8.712%) >> >> 9377/1002221 objects unfound (0.936%) >> >> Reduced data availability: 176 pgs inactive >> >> Degraded data redundancy: 609338/3006663 objects degraded >> >> (20.266%), 243 pgs unclea >> >> n, 222 pgs degraded, 213 pgs undersized >> >> mon server2 is low on available space >> >> >> >> services: >> >> mon: 3 daemons, quorum server5,server3,server2 >> >> mgr: server5(active), standbys: 2, server2, 0, server3 >> >> osd: 54 osds: 54 up, 54 in; 234 remapped pgs >> >> >> >> data: >> >> pools: 3 pools, 1344 pgs >> >> objects: 978k objects, 3823 GB >> >> usage: 9350 GB used, 40298 GB / 49648 GB avail >> >> pgs: 13.095% pgs not active >> >> 609338/3006663 objects degraded (20.266%) >> >> 261953/3006663 objects misplaced (8.712%) >> >> 9377/1002221 objects unfound (0.936%) >> >> 1101 active+clean >> >> 84 recovery_wait+undersized+degraded+remapped+peered >> >> 82 und
Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]
I do remember seeing that exactly. As the number of recovery_wait pgs decreased, the number of unfound objects decreased until they were all found. Unfortunately it blocked some IO from happening during the recovery, but in the long run we ended up with full data integrity again. On Mon, Jan 22, 2018 at 1:03 PM Nico Schottelius < nico.schottel...@ungleich.ch> wrote: > > Hey David, > > thanks for the fast answer. All our pools are running with size=3, > min_size=2 and the two disks were in 2 different hosts. > > What I am a bit worried about is the output of "ceph pg 4.fa query" (see > below) that indicates that ceph already queried all other hosts and did > not find the data anywhere. > > Do you remember having seen something similar? > > Best, > > Nico > > David Turner writes: > > > I have had the same problem before with unfound objects that happened > while > > backfilling after losing a drive. We didn't lose drives outside of the > > failure domains and ultimately didn't lose any data, but we did have to > > wait until after all of the PGs in recovery_wait state were caught up. > So > > if the 2 disks you lost were in the same host and your CRUSH rules are > set > > so that you can lose a host without losing data, then the cluster will > > likely find all of the objects by the time it's done backfilling. With > > only losing 2 disks, I wouldn't worry about the missing objects not > > becoming found unless you're pool size=2. > > > > On Mon, Jan 22, 2018 at 11:47 AM Nico Schottelius < > > nico.schottel...@ungleich.ch> wrote: > > > >> > >> Hello, > >> > >> we added about 7 new disks yesterday/today and our cluster became very > >> slow. While the rebalancing took place, 2 of the 7 new added disks > >> died. > >> > >> Our cluster is still recovering, however we spotted that there are a lot > >> of unfound objects. > >> > >> We lost osd.63 and osd.64, which seem not to be involved into the sample > >> pg that has unfound objects. > >> > >> We were wondering why there are unfound objects, where they are coming > >> from and if there is a way to recover them? > >> > >> Any help appreciated, > >> > >> Best, > >> > >> Nico > >> > >> > >> Our status is: > >> > >> cluster: > >> id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab > >> health: HEALTH_WARN > >> 261953/3006663 objects misplaced (8.712%) > >> 9377/1002221 objects unfound (0.936%) > >> Reduced data availability: 176 pgs inactive > >> Degraded data redundancy: 609338/3006663 objects degraded > >> (20.266%), 243 pgs unclea > >> n, 222 pgs degraded, 213 pgs undersized > >> mon server2 is low on available space > >> > >> services: > >> mon: 3 daemons, quorum server5,server3,server2 > >> mgr: server5(active), standbys: 2, server2, 0, server3 > >> osd: 54 osds: 54 up, 54 in; 234 remapped pgs > >> > >> data: > >> pools: 3 pools, 1344 pgs > >> objects: 978k objects, 3823 GB > >> usage: 9350 GB used, 40298 GB / 49648 GB avail > >> pgs: 13.095% pgs not active > >> 609338/3006663 objects degraded (20.266%) > >> 261953/3006663 objects misplaced (8.712%) > >> 9377/1002221 objects unfound (0.936%) > >> 1101 active+clean > >> 84 recovery_wait+undersized+degraded+remapped+peered > >> 82 undersized+degraded+remapped+backfill_wait+peered > >> 23 active+undersized+degraded+remapped+backfill_wait > >> 18 active+remapped+backfill_wait > >> 14 active+undersized+degraded+remapped+backfilling > >> 10 undersized+degraded+remapped+backfilling+peered > >> 9active+recovery_wait+degraded > >> 3active+remapped+backfilling > >> > >> io: > >> client: 624 kB/s rd, 3255 kB/s wr, 22 op/s rd, 66 op/s wr > >> recovery: 90148 kB/s, 22 objects/s > >> > >> Looking at the unfound objects: > >> > >> [17:32:17] server1:~# ceph health detail > >> HEALTH_WARN 263745/3006663 objects misplaced (8.772%); 9377/1002221 > >> objects unfound (0.936%); Reduced data availability: 176 pgs inactive; > >> Degraded data redundancy: 612398/3006663 objects degraded (20.368%), 244 > >> pgs unclean, 223 pgs degraded, 214 pgs undersized; mon server2 is low on > >> available space > >> OBJECT_MISPLACED 263745/3006663 objects misplaced (8.772%) > >> OBJECT_UNFOUND 9377/1002221 objects unfound (0.936%) > >> pg 4.fa has 117 unfound objects > >> pg 4.ff has 107 unfound objects > >> pg 4.fd has 113 unfound objects > >> pg 4.f0 has 120 unfound objects > >> > >> > >> > >> Output from ceph pg 4.fa query: > >> > >> { > >> "state": "recovery_wait+undersized+degraded+remapped+peered", > >> "snap_trimq": "[]", > >> "epoch": 17561, > >> "up": [ > >> 8, > >> 17, > >> 25 > >> ], > >> "acting": [ > >> 61 > >> ], > >> "backfill_targets": [ > >>
Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]
Hey David, thanks for the fast answer. All our pools are running with size=3, min_size=2 and the two disks were in 2 different hosts. What I am a bit worried about is the output of "ceph pg 4.fa query" (see below) that indicates that ceph already queried all other hosts and did not find the data anywhere. Do you remember having seen something similar? Best, Nico David Turner writes: > I have had the same problem before with unfound objects that happened while > backfilling after losing a drive. We didn't lose drives outside of the > failure domains and ultimately didn't lose any data, but we did have to > wait until after all of the PGs in recovery_wait state were caught up. So > if the 2 disks you lost were in the same host and your CRUSH rules are set > so that you can lose a host without losing data, then the cluster will > likely find all of the objects by the time it's done backfilling. With > only losing 2 disks, I wouldn't worry about the missing objects not > becoming found unless you're pool size=2. > > On Mon, Jan 22, 2018 at 11:47 AM Nico Schottelius < > nico.schottel...@ungleich.ch> wrote: > >> >> Hello, >> >> we added about 7 new disks yesterday/today and our cluster became very >> slow. While the rebalancing took place, 2 of the 7 new added disks >> died. >> >> Our cluster is still recovering, however we spotted that there are a lot >> of unfound objects. >> >> We lost osd.63 and osd.64, which seem not to be involved into the sample >> pg that has unfound objects. >> >> We were wondering why there are unfound objects, where they are coming >> from and if there is a way to recover them? >> >> Any help appreciated, >> >> Best, >> >> Nico >> >> >> Our status is: >> >> cluster: >> id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab >> health: HEALTH_WARN >> 261953/3006663 objects misplaced (8.712%) >> 9377/1002221 objects unfound (0.936%) >> Reduced data availability: 176 pgs inactive >> Degraded data redundancy: 609338/3006663 objects degraded >> (20.266%), 243 pgs unclea >> n, 222 pgs degraded, 213 pgs undersized >> mon server2 is low on available space >> >> services: >> mon: 3 daemons, quorum server5,server3,server2 >> mgr: server5(active), standbys: 2, server2, 0, server3 >> osd: 54 osds: 54 up, 54 in; 234 remapped pgs >> >> data: >> pools: 3 pools, 1344 pgs >> objects: 978k objects, 3823 GB >> usage: 9350 GB used, 40298 GB / 49648 GB avail >> pgs: 13.095% pgs not active >> 609338/3006663 objects degraded (20.266%) >> 261953/3006663 objects misplaced (8.712%) >> 9377/1002221 objects unfound (0.936%) >> 1101 active+clean >> 84 recovery_wait+undersized+degraded+remapped+peered >> 82 undersized+degraded+remapped+backfill_wait+peered >> 23 active+undersized+degraded+remapped+backfill_wait >> 18 active+remapped+backfill_wait >> 14 active+undersized+degraded+remapped+backfilling >> 10 undersized+degraded+remapped+backfilling+peered >> 9active+recovery_wait+degraded >> 3active+remapped+backfilling >> >> io: >> client: 624 kB/s rd, 3255 kB/s wr, 22 op/s rd, 66 op/s wr >> recovery: 90148 kB/s, 22 objects/s >> >> Looking at the unfound objects: >> >> [17:32:17] server1:~# ceph health detail >> HEALTH_WARN 263745/3006663 objects misplaced (8.772%); 9377/1002221 >> objects unfound (0.936%); Reduced data availability: 176 pgs inactive; >> Degraded data redundancy: 612398/3006663 objects degraded (20.368%), 244 >> pgs unclean, 223 pgs degraded, 214 pgs undersized; mon server2 is low on >> available space >> OBJECT_MISPLACED 263745/3006663 objects misplaced (8.772%) >> OBJECT_UNFOUND 9377/1002221 objects unfound (0.936%) >> pg 4.fa has 117 unfound objects >> pg 4.ff has 107 unfound objects >> pg 4.fd has 113 unfound objects >> pg 4.f0 has 120 unfound objects >> >> >> >> Output from ceph pg 4.fa query: >> >> { >> "state": "recovery_wait+undersized+degraded+remapped+peered", >> "snap_trimq": "[]", >> "epoch": 17561, >> "up": [ >> 8, >> 17, >> 25 >> ], >> "acting": [ >> 61 >> ], >> "backfill_targets": [ >> "8", >> "17", >> "25" >> ], >> "actingbackfill": [ >> "8", >> "17", >> "25", >> "61" >> ], >> "info": { >> "pgid": "4.fa", >> "last_update": "17529'85051", >> "last_complete": "17217'77468", >> "log_tail": "17091'75034", >> "last_user_version": 85051, >> "last_backfill": "MAX", >> "last_backfill_bitwise": 0, >> "purged_snaps": [ >> { >> "start": "1", >> "length": "3" >> }, >> { >> "start": "6", >>
Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]
I have had the same problem before with unfound objects that happened while backfilling after losing a drive. We didn't lose drives outside of the failure domains and ultimately didn't lose any data, but we did have to wait until after all of the PGs in recovery_wait state were caught up. So if the 2 disks you lost were in the same host and your CRUSH rules are set so that you can lose a host without losing data, then the cluster will likely find all of the objects by the time it's done backfilling. With only losing 2 disks, I wouldn't worry about the missing objects not becoming found unless you're pool size=2. On Mon, Jan 22, 2018 at 11:47 AM Nico Schottelius < nico.schottel...@ungleich.ch> wrote: > > Hello, > > we added about 7 new disks yesterday/today and our cluster became very > slow. While the rebalancing took place, 2 of the 7 new added disks > died. > > Our cluster is still recovering, however we spotted that there are a lot > of unfound objects. > > We lost osd.63 and osd.64, which seem not to be involved into the sample > pg that has unfound objects. > > We were wondering why there are unfound objects, where they are coming > from and if there is a way to recover them? > > Any help appreciated, > > Best, > > Nico > > > Our status is: > > cluster: > id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab > health: HEALTH_WARN > 261953/3006663 objects misplaced (8.712%) > 9377/1002221 objects unfound (0.936%) > Reduced data availability: 176 pgs inactive > Degraded data redundancy: 609338/3006663 objects degraded > (20.266%), 243 pgs unclea > n, 222 pgs degraded, 213 pgs undersized > mon server2 is low on available space > > services: > mon: 3 daemons, quorum server5,server3,server2 > mgr: server5(active), standbys: 2, server2, 0, server3 > osd: 54 osds: 54 up, 54 in; 234 remapped pgs > > data: > pools: 3 pools, 1344 pgs > objects: 978k objects, 3823 GB > usage: 9350 GB used, 40298 GB / 49648 GB avail > pgs: 13.095% pgs not active > 609338/3006663 objects degraded (20.266%) > 261953/3006663 objects misplaced (8.712%) > 9377/1002221 objects unfound (0.936%) > 1101 active+clean > 84 recovery_wait+undersized+degraded+remapped+peered > 82 undersized+degraded+remapped+backfill_wait+peered > 23 active+undersized+degraded+remapped+backfill_wait > 18 active+remapped+backfill_wait > 14 active+undersized+degraded+remapped+backfilling > 10 undersized+degraded+remapped+backfilling+peered > 9active+recovery_wait+degraded > 3active+remapped+backfilling > > io: > client: 624 kB/s rd, 3255 kB/s wr, 22 op/s rd, 66 op/s wr > recovery: 90148 kB/s, 22 objects/s > > Looking at the unfound objects: > > [17:32:17] server1:~# ceph health detail > HEALTH_WARN 263745/3006663 objects misplaced (8.772%); 9377/1002221 > objects unfound (0.936%); Reduced data availability: 176 pgs inactive; > Degraded data redundancy: 612398/3006663 objects degraded (20.368%), 244 > pgs unclean, 223 pgs degraded, 214 pgs undersized; mon server2 is low on > available space > OBJECT_MISPLACED 263745/3006663 objects misplaced (8.772%) > OBJECT_UNFOUND 9377/1002221 objects unfound (0.936%) > pg 4.fa has 117 unfound objects > pg 4.ff has 107 unfound objects > pg 4.fd has 113 unfound objects > pg 4.f0 has 120 unfound objects > > > > Output from ceph pg 4.fa query: > > { > "state": "recovery_wait+undersized+degraded+remapped+peered", > "snap_trimq": "[]", > "epoch": 17561, > "up": [ > 8, > 17, > 25 > ], > "acting": [ > 61 > ], > "backfill_targets": [ > "8", > "17", > "25" > ], > "actingbackfill": [ > "8", > "17", > "25", > "61" > ], > "info": { > "pgid": "4.fa", > "last_update": "17529'85051", > "last_complete": "17217'77468", > "log_tail": "17091'75034", > "last_user_version": 85051, > "last_backfill": "MAX", > "last_backfill_bitwise": 0, > "purged_snaps": [ > { > "start": "1", > "length": "3" > }, > { > "start": "6", > "length": "8" > }, > { > "start": "10", > "length": "2" > } > ], > "history": { > "epoch_created": 9134, > "epoch_pool_created": 9134, > "last_epoch_started": 17528, > "last_interval_started": 17527, > "last_epoch_clean": 17079, > "last_interval_clean": 17078, > "last_epoch_split": 0, > "last_epoch_marked_full": 0, > "same_up_since": 17143,
[ceph-users] Adding disks -> getting unfound objects [Luminous]
Hello, we added about 7 new disks yesterday/today and our cluster became very slow. While the rebalancing took place, 2 of the 7 new added disks died. Our cluster is still recovering, however we spotted that there are a lot of unfound objects. We lost osd.63 and osd.64, which seem not to be involved into the sample pg that has unfound objects. We were wondering why there are unfound objects, where they are coming from and if there is a way to recover them? Any help appreciated, Best, Nico Our status is: cluster: id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab health: HEALTH_WARN 261953/3006663 objects misplaced (8.712%) 9377/1002221 objects unfound (0.936%) Reduced data availability: 176 pgs inactive Degraded data redundancy: 609338/3006663 objects degraded (20.266%), 243 pgs unclea n, 222 pgs degraded, 213 pgs undersized mon server2 is low on available space services: mon: 3 daemons, quorum server5,server3,server2 mgr: server5(active), standbys: 2, server2, 0, server3 osd: 54 osds: 54 up, 54 in; 234 remapped pgs data: pools: 3 pools, 1344 pgs objects: 978k objects, 3823 GB usage: 9350 GB used, 40298 GB / 49648 GB avail pgs: 13.095% pgs not active 609338/3006663 objects degraded (20.266%) 261953/3006663 objects misplaced (8.712%) 9377/1002221 objects unfound (0.936%) 1101 active+clean 84 recovery_wait+undersized+degraded+remapped+peered 82 undersized+degraded+remapped+backfill_wait+peered 23 active+undersized+degraded+remapped+backfill_wait 18 active+remapped+backfill_wait 14 active+undersized+degraded+remapped+backfilling 10 undersized+degraded+remapped+backfilling+peered 9active+recovery_wait+degraded 3active+remapped+backfilling io: client: 624 kB/s rd, 3255 kB/s wr, 22 op/s rd, 66 op/s wr recovery: 90148 kB/s, 22 objects/s Looking at the unfound objects: [17:32:17] server1:~# ceph health detail HEALTH_WARN 263745/3006663 objects misplaced (8.772%); 9377/1002221 objects unfound (0.936%); Reduced data availability: 176 pgs inactive; Degraded data redundancy: 612398/3006663 objects degraded (20.368%), 244 pgs unclean, 223 pgs degraded, 214 pgs undersized; mon server2 is low on available space OBJECT_MISPLACED 263745/3006663 objects misplaced (8.772%) OBJECT_UNFOUND 9377/1002221 objects unfound (0.936%) pg 4.fa has 117 unfound objects pg 4.ff has 107 unfound objects pg 4.fd has 113 unfound objects pg 4.f0 has 120 unfound objects Output from ceph pg 4.fa query: { "state": "recovery_wait+undersized+degraded+remapped+peered", "snap_trimq": "[]", "epoch": 17561, "up": [ 8, 17, 25 ], "acting": [ 61 ], "backfill_targets": [ "8", "17", "25" ], "actingbackfill": [ "8", "17", "25", "61" ], "info": { "pgid": "4.fa", "last_update": "17529'85051", "last_complete": "17217'77468", "log_tail": "17091'75034", "last_user_version": 85051, "last_backfill": "MAX", "last_backfill_bitwise": 0, "purged_snaps": [ { "start": "1", "length": "3" }, { "start": "6", "length": "8" }, { "start": "10", "length": "2" } ], "history": { "epoch_created": 9134, "epoch_pool_created": 9134, "last_epoch_started": 17528, "last_interval_started": 17527, "last_epoch_clean": 17079, "last_interval_clean": 17078, "last_epoch_split": 0, "last_epoch_marked_full": 0, "same_up_since": 17143, "same_interval_since": 17530, "same_primary_since": 17515, "last_scrub": "17090'57357", "last_scrub_stamp": "2018-01-20 20:45:32.616142", "last_deep_scrub": "17082'54734", "last_deep_scrub_stamp": "2018-01-15 21:09:34.121488", "last_clean_scrub_stamp": "2018-01-20 20:45:32.616142" }, "stats": { "version": "17529'85051", "reported_seq": "218453", "reported_epoch": "17561", "state": "recovery_wait+undersized+degraded+remapped+peered", "last_fresh": "2018-01-22 17:42:28.196701", "last_change": "2018-01-22 15:00:46.507189", "last_active": "2018-01-22 15:00:44.635399", "last_peered": "2018-01-22 17:42:28.196701", "last_clean": "2018-01-21 20:15:48.267209", "last_became_active": "2018-01-22 14:53:07.918893", "last_became_peered