Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-23 Thread Nico Schottelius

Hey Burkhard,

we did actually restart osd.61, which led to the current status.

Best,

Nico


Burkhard Linke  writes:>
> On 01/23/2018 08:54 AM, Nico Schottelius wrote:
>> Good morning,
>>
>> the osd.61 actually just crashed and the disk is still intact. However,
>> after 8 hours of rebuilding, the unfound objects are still missing:
>
> *snipsnap*
>>
>>
>> Is there any chance to recover those pgs or did we actually lose data
>> with a 2 disk failure?
>>
>> And is there any way out  of this besides going with
>>
>>  ceph pg {pg-id} mark_unfound_lost revert|delete
>>
>> ?
>
> Just my 2 cents:
>
> If the disk is still intact and the data is still readable, you can try
> to export the pg content with ceph-objectstore-tool, and import it into
> another OSD.
>
> On the other hand: if the disk is still intact, just restart the OSD?

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-23 Thread Burkhard Linke

Hi,


On 01/23/2018 08:54 AM, Nico Schottelius wrote:

Good morning,

the osd.61 actually just crashed and the disk is still intact. However,
after 8 hours of rebuilding, the unfound objects are still missing:


*snipsnap*



Is there any chance to recover those pgs or did we actually lose data
with a 2 disk failure?

And is there any way out  of this besides going with

 ceph pg {pg-id} mark_unfound_lost revert|delete

?


Just my 2 cents:

If the disk is still intact and the data is still readable, you can try 
to export the pg content with ceph-objectstore-tool, and import it into 
another OSD.


On the other hand: if the disk is still intact, just restart the OSD?

Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-23 Thread Nico Schottelius

... while trying to locate which VMs are potentially affected by a
revert/delete, we noticed that

root@server1:~# rados -p one-hdd ls

hangs. Where does ceph store the index of block devices found in a pool?
And is it possible that this information is in one of the damaged pgs?

Nico


Nico Schottelius  writes:

> Good morning,
>
> the osd.61 actually just crashed and the disk is still intact. However,
> after 8 hours of rebuilding, the unfound objects are still missing:
>
> root@server1:~# ceph -s
>   cluster:
> id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
> health: HEALTH_WARN
> noscrub,nodeep-scrub flag(s) set
> 111436/3017766 objects misplaced (3.693%)
> 9377/1005922 objects unfound (0.932%)
> Reduced data availability: 84 pgs inactive
> Degraded data redundancy: 277034/3017766 objects degraded 
> (9.180%), 84 pgs unclean, 84 pgs degraded, 84 pgs undersized
> mon server2 is low on available space
>
>   services:
> mon: 3 daemons, quorum server5,server3,server2
> mgr: server5(active), standbys: server2, 2, 0, server3
> osd: 54 osds: 54 up, 54 in; 84 remapped pgs
>  flags noscrub,nodeep-scrub
>
>   data:
> pools:   3 pools, 1344 pgs
> objects: 982k objects, 3837 GB
> usage:   10618 GB used, 39030 GB / 49648 GB avail
> pgs: 6.250% pgs not active
>  277034/3017766 objects degraded (9.180%)
>  111436/3017766 objects misplaced (3.693%)
>  9377/1005922 objects unfound (0.932%)
>  1260 active+clean
>  84   recovery_wait+undersized+degraded+remapped+peered
>
>   io:
> client:   68960 B/s rd, 20722 kB/s wr, 12 op/s rd, 77 op/s wr
>
> We tried restarting osd.61, but ceph health detail does not change
> anymore:
>
> HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 111436/3017886 objects 
> misplaced (3.69
> 3%); 9377/1005962 objects unfound (0.932%); Reduced data availability: 84 pgs 
> inacti
> ve; Degraded data redundancy: 277034/3017886 objects degraded (9.180%), 84 
> pgs uncle
> an, 84 pgs degraded, 84 pgs undersized; mon server2 is low on available space
> OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
> OBJECT_MISPLACED 111436/3017886 objects misplaced (3.693%)
> OBJECT_UNFOUND 9377/1005962 objects unfound (0.932%)
> pg 4.fa has 117 unfound objects
> pg 4.ff has 107 unfound objects
> pg 4.fd has 113 unfound objects
> ...
> pg 4.2a has 108 unfound objects
>
> PG_AVAILABILITY Reduced data availability: 84 pgs inactive
> pg 4.2a is stuck inactive for 64117.189552, current state 
> recovery_wait+undersiz
> ed+degraded+remapped+peered, last acting [61]
> pg 4.31 is stuck inactive for 64117.147636, current state 
> recovery_wait+undersiz
> ed+degraded+remapped+peered, last acting [61]
> pg 4.32 is stuck inactive for 64117.178461, current state 
> recovery_wait+undersiz
> ed+degraded+remapped+peered, last acting [61]
> pg 4.34 is stuck inactive for 64117.150475, current state 
> recovery_wait+undersiz
> ed+degraded+remapped+peered, last acting [61]
> ...
>
>
> PG_DEGRADED Degraded data redundancy: 277034/3017886 objects degraded 
> (9.180%), 84 pgs unclean, 84 pgs degraded, 84 pgs undersized
> pg 4.2a is stuck unclean for 131612.984555, current state 
> recovery_wait+undersized+degraded+remapped+peered, last acting [61]
> pg 4.31 is stuck undersized for 221.568468, current state 
> recovery_wait+undersized+degraded+remapped+peered, last acting [61]
>
>
> Is there any chance to recover those pgs or did we actually lose data
> with a 2 disk failure?
>
> And is there any way out  of this besides going with
>
> ceph pg {pg-id} mark_unfound_lost revert|delete
>
> ?
>
> Best,
>
> Nico
>
> p.s.: the ceph 4.2a query:
>
> {
> "state": "recovery_wait+undersized+degraded+remapped+peered",
> "snap_trimq": "[]",
> "epoch": 17879,
> "up": [
> 17,
> 13,
> 25
> ],
> "acting": [
> 61
> ],
> "backfill_targets": [
> "13",
> "17",
> "25"
> ],
> "actingbackfill": [
> "13",
> "17",
> "25",
> "61"
> ],
> "info": {
> "pgid": "4.2a",
> "last_update": "17529'53875",
> "last_complete": "17217'45447",
> "log_tail": "17090'43812",
> "last_user_version": 53875,
> "last_backfill": "MAX",
> "last_backfill_bitwise": 0,
> "purged_snaps": [
> {
> "start": "1",
> "length": "3"
> },
> {
> "start": "6",
> "length": "8"
> },
> {
> "start": "10",
> "length": "2"
> }
> ],
> "history": {
> "epoch_created": 9134,
> "epoch_pool_created": 9134,
> "last_epoch_started": 17528,
> "last_interval_started": 17527,
>   

Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-22 Thread Nico Schottelius

Good morning,

the osd.61 actually just crashed and the disk is still intact. However,
after 8 hours of rebuilding, the unfound objects are still missing:

root@server1:~# ceph -s
  cluster:
id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
111436/3017766 objects misplaced (3.693%)
9377/1005922 objects unfound (0.932%)
Reduced data availability: 84 pgs inactive
Degraded data redundancy: 277034/3017766 objects degraded (9.180%), 
84 pgs unclean, 84 pgs degraded, 84 pgs undersized
mon server2 is low on available space

  services:
mon: 3 daemons, quorum server5,server3,server2
mgr: server5(active), standbys: server2, 2, 0, server3
osd: 54 osds: 54 up, 54 in; 84 remapped pgs
 flags noscrub,nodeep-scrub

  data:
pools:   3 pools, 1344 pgs
objects: 982k objects, 3837 GB
usage:   10618 GB used, 39030 GB / 49648 GB avail
pgs: 6.250% pgs not active
 277034/3017766 objects degraded (9.180%)
 111436/3017766 objects misplaced (3.693%)
 9377/1005922 objects unfound (0.932%)
 1260 active+clean
 84   recovery_wait+undersized+degraded+remapped+peered

  io:
client:   68960 B/s rd, 20722 kB/s wr, 12 op/s rd, 77 op/s wr

We tried restarting osd.61, but ceph health detail does not change
anymore:

HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 111436/3017886 objects misplaced 
(3.69
3%); 9377/1005962 objects unfound (0.932%); Reduced data availability: 84 pgs 
inacti
ve; Degraded data redundancy: 277034/3017886 objects degraded (9.180%), 84 pgs 
uncle
an, 84 pgs degraded, 84 pgs undersized; mon server2 is low on available space
OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
OBJECT_MISPLACED 111436/3017886 objects misplaced (3.693%)
OBJECT_UNFOUND 9377/1005962 objects unfound (0.932%)
pg 4.fa has 117 unfound objects
pg 4.ff has 107 unfound objects
pg 4.fd has 113 unfound objects
...
pg 4.2a has 108 unfound objects

PG_AVAILABILITY Reduced data availability: 84 pgs inactive
pg 4.2a is stuck inactive for 64117.189552, current state 
recovery_wait+undersiz
ed+degraded+remapped+peered, last acting [61]
pg 4.31 is stuck inactive for 64117.147636, current state 
recovery_wait+undersiz
ed+degraded+remapped+peered, last acting [61]
pg 4.32 is stuck inactive for 64117.178461, current state 
recovery_wait+undersiz
ed+degraded+remapped+peered, last acting [61]
pg 4.34 is stuck inactive for 64117.150475, current state 
recovery_wait+undersiz
ed+degraded+remapped+peered, last acting [61]
...


PG_DEGRADED Degraded data redundancy: 277034/3017886 objects degraded (9.180%), 
84 pgs unclean, 84 pgs degraded, 84 pgs undersized
pg 4.2a is stuck unclean for 131612.984555, current state 
recovery_wait+undersized+degraded+remapped+peered, last acting [61]
pg 4.31 is stuck undersized for 221.568468, current state 
recovery_wait+undersized+degraded+remapped+peered, last acting [61]


Is there any chance to recover those pgs or did we actually lose data
with a 2 disk failure?

And is there any way out  of this besides going with

ceph pg {pg-id} mark_unfound_lost revert|delete

?

Best,

Nico

p.s.: the ceph 4.2a query:

{
"state": "recovery_wait+undersized+degraded+remapped+peered",
"snap_trimq": "[]",
"epoch": 17879,
"up": [
17,
13,
25
],
"acting": [
61
],
"backfill_targets": [
"13",
"17",
"25"
],
"actingbackfill": [
"13",
"17",
"25",
"61"
],
"info": {
"pgid": "4.2a",
"last_update": "17529'53875",
"last_complete": "17217'45447",
"log_tail": "17090'43812",
"last_user_version": 53875,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [
{
"start": "1",
"length": "3"
},
{
"start": "6",
"length": "8"
},
{
"start": "10",
"length": "2"
}
],
"history": {
"epoch_created": 9134,
"epoch_pool_created": 9134,
"last_epoch_started": 17528,
"last_interval_started": 17527,
"last_epoch_clean": 17079,
"last_interval_clean": 17078,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 17143,
"same_interval_since": 17878,
"same_primary_since": 17878,
"last_scrub": "17090'44622",
"last_scrub_stamp": "2018-01-21 09:37:09.888508",
"last_deep_scrub": "17090'42219",
"last_deep_scrub_stamp": "2018-01-20 05:05:45.372052",
"last_clean_scrub_stamp": "2018-01-21 09:37:09.888508"
},
"stats": {
 

Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-22 Thread David Turner
Weight the remaining disks you added to 0.0.  They seem to be a bad batch.
This will start moving their data off of them and back onto the rest of the
cluster.  I generally suggest not to add storage in more than what you can
afford to lose, unless you trust your burn-in process.  So if you have a
host failure domain and size=3, I wouldn't add storage in more than 2 nodes
at a time in case the disks die.  That way you are much less likely to have
scares.

I assume this disk was in a third node leaving you with 3 failed disks
across 3 hosts?  It doesn't seem like these drives are going to work out
and I would immediately weight all newly added disks to 0.0 and get back to
a point where you are no longer backfilling/recovering PGs and see where
things are at from there.

On Mon, Jan 22, 2018 at 1:33 PM Nico Schottelius <
nico.schottel...@ungleich.ch> wrote:

>
> While writing, yet another disk (osd.61 now) died and now we have
> 172 pgs down:
>
> [19:32:35] server2:~# ceph -s
>   cluster:
> id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
> health: HEALTH_WARN
> noscrub,nodeep-scrub flag(s) set
> 21033/2263701 objects misplaced (0.929%)
> Reduced data availability: 186 pgs inactive, 172 pgs down
> Degraded data redundancy: 67370/2263701 objects degraded
> (2.976%), 219 pgs unclean, 46 pgs degraded, 46 pgs undersized
> mon server2 is low on available space
>
>   services:
> mon: 3 daemons, quorum server5,server3,server2
> mgr: server5(active), standbys: server2, 2, 0, server3
> osd: 54 osds: 53 up, 53 in; 47 remapped pgs
>  flags noscrub,nodeep-scrub
>
>   data:
> pools:   3 pools, 1344 pgs
> objects: 736k objects, 2889 GB
> usage:   8517 GB used, 36474 GB / 44991 GB avail
> pgs: 13.839% pgs not active
>  67370/2263701 objects degraded (2.976%)
>  21033/2263701 objects misplaced (0.929%)
>  1125 active+clean
>  172  down
>  26   active+undersized+degraded+remapped+backfilling
>  14   undersized+degraded+remapped+backfilling+peered
>  6active+undersized+degraded+remapped+backfill_wait
>  1active+remapped+backfill_wait
>
>   io:
> client:   835 kB/s rd, 262 kB/s wr, 16 op/s rd, 25 op/s wr
> recovery: 102 MB/s, 26 objects/s
>
> What is the most sensible way to get out of this situation?
>
>
>
>
>
> David Turner  writes:
>
> > I do remember seeing that exactly. As the number of recovery_wait pgs
> > decreased, the number of unfound objects decreased until they were all
> > found.  Unfortunately it blocked some IO from happening during the
> > recovery, but in the long run we ended up with full data integrity again.
> >
> > On Mon, Jan 22, 2018 at 1:03 PM Nico Schottelius <
> > nico.schottel...@ungleich.ch> wrote:
> >
> >>
> >> Hey David,
> >>
> >> thanks for the fast answer. All our pools are running with size=3,
> >> min_size=2 and the two disks were in 2 different hosts.
> >>
> >> What I am a bit worried about is the output of "ceph pg 4.fa query" (see
> >> below) that indicates that ceph already queried all other hosts and did
> >> not find the data anywhere.
> >>
> >> Do you remember having seen something similar?
> >>
> >> Best,
> >>
> >> Nico
> >>
> >> David Turner  writes:
> >>
> >> > I have had the same problem before with unfound objects that happened
> >> while
> >> > backfilling after losing a drive. We didn't lose drives outside of the
> >> > failure domains and ultimately didn't lose any data, but we did have
> to
> >> > wait until after all of the PGs in recovery_wait state were caught up.
> >> So
> >> > if the 2 disks you lost were in the same host and your CRUSH rules are
> >> set
> >> > so that you can lose a host without losing data, then the cluster will
> >> > likely find all of the objects by the time it's done backfilling.
> With
> >> > only losing 2 disks, I wouldn't worry about the missing objects not
> >> > becoming found unless you're pool size=2.
> >> >
> >> > On Mon, Jan 22, 2018 at 11:47 AM Nico Schottelius <
> >> > nico.schottel...@ungleich.ch> wrote:
> >> >
> >> >>
> >> >> Hello,
> >> >>
> >> >> we added about 7 new disks yesterday/today and our cluster became
> very
> >> >> slow. While the rebalancing took place, 2 of the 7 new added disks
> >> >> died.
> >> >>
> >> >> Our cluster is still recovering, however we spotted that there are a
> lot
> >> >> of unfound objects.
> >> >>
> >> >> We lost osd.63 and osd.64, which seem not to be involved into the
> sample
> >> >> pg that has unfound objects.
> >> >>
> >> >> We were wondering why there are unfound objects, where they are
> coming
> >> >> from and if there is a way to recover them?
> >> >>
> >> >> Any help appreciated,
> >> >>
> >> >> Best,
> >> >>
> >> >> Nico
> >> >>
> >> >>
> >> >> Our status is:
> >> >>
> >> >>   cluster:
> >> >> id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
> >> >> health: HEALTH_WARN
> >>

Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-22 Thread Nico Schottelius

While writing, yet another disk (osd.61 now) died and now we have
172 pgs down:

[19:32:35] server2:~# ceph -s
  cluster:
id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
21033/2263701 objects misplaced (0.929%)
Reduced data availability: 186 pgs inactive, 172 pgs down
Degraded data redundancy: 67370/2263701 objects degraded (2.976%), 
219 pgs unclean, 46 pgs degraded, 46 pgs undersized
mon server2 is low on available space

  services:
mon: 3 daemons, quorum server5,server3,server2
mgr: server5(active), standbys: server2, 2, 0, server3
osd: 54 osds: 53 up, 53 in; 47 remapped pgs
 flags noscrub,nodeep-scrub

  data:
pools:   3 pools, 1344 pgs
objects: 736k objects, 2889 GB
usage:   8517 GB used, 36474 GB / 44991 GB avail
pgs: 13.839% pgs not active
 67370/2263701 objects degraded (2.976%)
 21033/2263701 objects misplaced (0.929%)
 1125 active+clean
 172  down
 26   active+undersized+degraded+remapped+backfilling
 14   undersized+degraded+remapped+backfilling+peered
 6active+undersized+degraded+remapped+backfill_wait
 1active+remapped+backfill_wait

  io:
client:   835 kB/s rd, 262 kB/s wr, 16 op/s rd, 25 op/s wr
recovery: 102 MB/s, 26 objects/s

What is the most sensible way to get out of this situation?





David Turner  writes:

> I do remember seeing that exactly. As the number of recovery_wait pgs
> decreased, the number of unfound objects decreased until they were all
> found.  Unfortunately it blocked some IO from happening during the
> recovery, but in the long run we ended up with full data integrity again.
>
> On Mon, Jan 22, 2018 at 1:03 PM Nico Schottelius <
> nico.schottel...@ungleich.ch> wrote:
>
>>
>> Hey David,
>>
>> thanks for the fast answer. All our pools are running with size=3,
>> min_size=2 and the two disks were in 2 different hosts.
>>
>> What I am a bit worried about is the output of "ceph pg 4.fa query" (see
>> below) that indicates that ceph already queried all other hosts and did
>> not find the data anywhere.
>>
>> Do you remember having seen something similar?
>>
>> Best,
>>
>> Nico
>>
>> David Turner  writes:
>>
>> > I have had the same problem before with unfound objects that happened
>> while
>> > backfilling after losing a drive. We didn't lose drives outside of the
>> > failure domains and ultimately didn't lose any data, but we did have to
>> > wait until after all of the PGs in recovery_wait state were caught up.
>> So
>> > if the 2 disks you lost were in the same host and your CRUSH rules are
>> set
>> > so that you can lose a host without losing data, then the cluster will
>> > likely find all of the objects by the time it's done backfilling.  With
>> > only losing 2 disks, I wouldn't worry about the missing objects not
>> > becoming found unless you're pool size=2.
>> >
>> > On Mon, Jan 22, 2018 at 11:47 AM Nico Schottelius <
>> > nico.schottel...@ungleich.ch> wrote:
>> >
>> >>
>> >> Hello,
>> >>
>> >> we added about 7 new disks yesterday/today and our cluster became very
>> >> slow. While the rebalancing took place, 2 of the 7 new added disks
>> >> died.
>> >>
>> >> Our cluster is still recovering, however we spotted that there are a lot
>> >> of unfound objects.
>> >>
>> >> We lost osd.63 and osd.64, which seem not to be involved into the sample
>> >> pg that has unfound objects.
>> >>
>> >> We were wondering why there are unfound objects, where they are coming
>> >> from and if there is a way to recover them?
>> >>
>> >> Any help appreciated,
>> >>
>> >> Best,
>> >>
>> >> Nico
>> >>
>> >>
>> >> Our status is:
>> >>
>> >>   cluster:
>> >> id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
>> >> health: HEALTH_WARN
>> >> 261953/3006663 objects misplaced (8.712%)
>> >> 9377/1002221 objects unfound (0.936%)
>> >> Reduced data availability: 176 pgs inactive
>> >> Degraded data redundancy: 609338/3006663 objects degraded
>> >> (20.266%), 243 pgs unclea
>> >> n, 222 pgs degraded, 213 pgs undersized
>> >> mon server2 is low on available space
>> >>
>> >>   services:
>> >> mon: 3 daemons, quorum server5,server3,server2
>> >> mgr: server5(active), standbys: 2, server2, 0, server3
>> >> osd: 54 osds: 54 up, 54 in; 234 remapped pgs
>> >>
>> >>   data:
>> >> pools:   3 pools, 1344 pgs
>> >> objects: 978k objects, 3823 GB
>> >> usage:   9350 GB used, 40298 GB / 49648 GB avail
>> >> pgs: 13.095% pgs not active
>> >>  609338/3006663 objects degraded (20.266%)
>> >>  261953/3006663 objects misplaced (8.712%)
>> >>  9377/1002221 objects unfound (0.936%)
>> >>  1101 active+clean
>> >>  84   recovery_wait+undersized+degraded+remapped+peered
>> >>  82   und

Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-22 Thread David Turner
I do remember seeing that exactly. As the number of recovery_wait pgs
decreased, the number of unfound objects decreased until they were all
found.  Unfortunately it blocked some IO from happening during the
recovery, but in the long run we ended up with full data integrity again.

On Mon, Jan 22, 2018 at 1:03 PM Nico Schottelius <
nico.schottel...@ungleich.ch> wrote:

>
> Hey David,
>
> thanks for the fast answer. All our pools are running with size=3,
> min_size=2 and the two disks were in 2 different hosts.
>
> What I am a bit worried about is the output of "ceph pg 4.fa query" (see
> below) that indicates that ceph already queried all other hosts and did
> not find the data anywhere.
>
> Do you remember having seen something similar?
>
> Best,
>
> Nico
>
> David Turner  writes:
>
> > I have had the same problem before with unfound objects that happened
> while
> > backfilling after losing a drive. We didn't lose drives outside of the
> > failure domains and ultimately didn't lose any data, but we did have to
> > wait until after all of the PGs in recovery_wait state were caught up.
> So
> > if the 2 disks you lost were in the same host and your CRUSH rules are
> set
> > so that you can lose a host without losing data, then the cluster will
> > likely find all of the objects by the time it's done backfilling.  With
> > only losing 2 disks, I wouldn't worry about the missing objects not
> > becoming found unless you're pool size=2.
> >
> > On Mon, Jan 22, 2018 at 11:47 AM Nico Schottelius <
> > nico.schottel...@ungleich.ch> wrote:
> >
> >>
> >> Hello,
> >>
> >> we added about 7 new disks yesterday/today and our cluster became very
> >> slow. While the rebalancing took place, 2 of the 7 new added disks
> >> died.
> >>
> >> Our cluster is still recovering, however we spotted that there are a lot
> >> of unfound objects.
> >>
> >> We lost osd.63 and osd.64, which seem not to be involved into the sample
> >> pg that has unfound objects.
> >>
> >> We were wondering why there are unfound objects, where they are coming
> >> from and if there is a way to recover them?
> >>
> >> Any help appreciated,
> >>
> >> Best,
> >>
> >> Nico
> >>
> >>
> >> Our status is:
> >>
> >>   cluster:
> >> id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
> >> health: HEALTH_WARN
> >> 261953/3006663 objects misplaced (8.712%)
> >> 9377/1002221 objects unfound (0.936%)
> >> Reduced data availability: 176 pgs inactive
> >> Degraded data redundancy: 609338/3006663 objects degraded
> >> (20.266%), 243 pgs unclea
> >> n, 222 pgs degraded, 213 pgs undersized
> >> mon server2 is low on available space
> >>
> >>   services:
> >> mon: 3 daemons, quorum server5,server3,server2
> >> mgr: server5(active), standbys: 2, server2, 0, server3
> >> osd: 54 osds: 54 up, 54 in; 234 remapped pgs
> >>
> >>   data:
> >> pools:   3 pools, 1344 pgs
> >> objects: 978k objects, 3823 GB
> >> usage:   9350 GB used, 40298 GB / 49648 GB avail
> >> pgs: 13.095% pgs not active
> >>  609338/3006663 objects degraded (20.266%)
> >>  261953/3006663 objects misplaced (8.712%)
> >>  9377/1002221 objects unfound (0.936%)
> >>  1101 active+clean
> >>  84   recovery_wait+undersized+degraded+remapped+peered
> >>  82   undersized+degraded+remapped+backfill_wait+peered
> >>  23   active+undersized+degraded+remapped+backfill_wait
> >>  18   active+remapped+backfill_wait
> >>  14   active+undersized+degraded+remapped+backfilling
> >>  10   undersized+degraded+remapped+backfilling+peered
> >>  9active+recovery_wait+degraded
> >>  3active+remapped+backfilling
> >>
> >>   io:
> >> client:   624 kB/s rd, 3255 kB/s wr, 22 op/s rd, 66 op/s wr
> >> recovery: 90148 kB/s, 22 objects/s
> >>
> >> Looking at the unfound objects:
> >>
> >> [17:32:17] server1:~# ceph health detail
> >> HEALTH_WARN 263745/3006663 objects misplaced (8.772%); 9377/1002221
> >> objects unfound (0.936%); Reduced data availability: 176 pgs inactive;
> >> Degraded data redundancy: 612398/3006663 objects degraded (20.368%), 244
> >> pgs unclean, 223 pgs degraded, 214 pgs undersized; mon server2 is low on
> >> available space
> >> OBJECT_MISPLACED 263745/3006663 objects misplaced (8.772%)
> >> OBJECT_UNFOUND 9377/1002221 objects unfound (0.936%)
> >> pg 4.fa has 117 unfound objects
> >> pg 4.ff has 107 unfound objects
> >> pg 4.fd has 113 unfound objects
> >> pg 4.f0 has 120 unfound objects
> >> 
> >>
> >>
> >> Output from ceph pg 4.fa query:
> >>
> >> {
> >> "state": "recovery_wait+undersized+degraded+remapped+peered",
> >> "snap_trimq": "[]",
> >> "epoch": 17561,
> >> "up": [
> >> 8,
> >> 17,
> >> 25
> >> ],
> >> "acting": [
> >> 61
> >> ],
> >> "backfill_targets": [
> >> 

Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-22 Thread Nico Schottelius

Hey David,

thanks for the fast answer. All our pools are running with size=3,
min_size=2 and the two disks were in 2 different hosts.

What I am a bit worried about is the output of "ceph pg 4.fa query" (see
below) that indicates that ceph already queried all other hosts and did
not find the data anywhere.

Do you remember having seen something similar?

Best,

Nico

David Turner  writes:

> I have had the same problem before with unfound objects that happened while
> backfilling after losing a drive. We didn't lose drives outside of the
> failure domains and ultimately didn't lose any data, but we did have to
> wait until after all of the PGs in recovery_wait state were caught up.  So
> if the 2 disks you lost were in the same host and your CRUSH rules are set
> so that you can lose a host without losing data, then the cluster will
> likely find all of the objects by the time it's done backfilling.  With
> only losing 2 disks, I wouldn't worry about the missing objects not
> becoming found unless you're pool size=2.
>
> On Mon, Jan 22, 2018 at 11:47 AM Nico Schottelius <
> nico.schottel...@ungleich.ch> wrote:
>
>>
>> Hello,
>>
>> we added about 7 new disks yesterday/today and our cluster became very
>> slow. While the rebalancing took place, 2 of the 7 new added disks
>> died.
>>
>> Our cluster is still recovering, however we spotted that there are a lot
>> of unfound objects.
>>
>> We lost osd.63 and osd.64, which seem not to be involved into the sample
>> pg that has unfound objects.
>>
>> We were wondering why there are unfound objects, where they are coming
>> from and if there is a way to recover them?
>>
>> Any help appreciated,
>>
>> Best,
>>
>> Nico
>>
>>
>> Our status is:
>>
>>   cluster:
>> id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
>> health: HEALTH_WARN
>> 261953/3006663 objects misplaced (8.712%)
>> 9377/1002221 objects unfound (0.936%)
>> Reduced data availability: 176 pgs inactive
>> Degraded data redundancy: 609338/3006663 objects degraded
>> (20.266%), 243 pgs unclea
>> n, 222 pgs degraded, 213 pgs undersized
>> mon server2 is low on available space
>>
>>   services:
>> mon: 3 daemons, quorum server5,server3,server2
>> mgr: server5(active), standbys: 2, server2, 0, server3
>> osd: 54 osds: 54 up, 54 in; 234 remapped pgs
>>
>>   data:
>> pools:   3 pools, 1344 pgs
>> objects: 978k objects, 3823 GB
>> usage:   9350 GB used, 40298 GB / 49648 GB avail
>> pgs: 13.095% pgs not active
>>  609338/3006663 objects degraded (20.266%)
>>  261953/3006663 objects misplaced (8.712%)
>>  9377/1002221 objects unfound (0.936%)
>>  1101 active+clean
>>  84   recovery_wait+undersized+degraded+remapped+peered
>>  82   undersized+degraded+remapped+backfill_wait+peered
>>  23   active+undersized+degraded+remapped+backfill_wait
>>  18   active+remapped+backfill_wait
>>  14   active+undersized+degraded+remapped+backfilling
>>  10   undersized+degraded+remapped+backfilling+peered
>>  9active+recovery_wait+degraded
>>  3active+remapped+backfilling
>>
>>   io:
>> client:   624 kB/s rd, 3255 kB/s wr, 22 op/s rd, 66 op/s wr
>> recovery: 90148 kB/s, 22 objects/s
>>
>> Looking at the unfound objects:
>>
>> [17:32:17] server1:~# ceph health detail
>> HEALTH_WARN 263745/3006663 objects misplaced (8.772%); 9377/1002221
>> objects unfound (0.936%); Reduced data availability: 176 pgs inactive;
>> Degraded data redundancy: 612398/3006663 objects degraded (20.368%), 244
>> pgs unclean, 223 pgs degraded, 214 pgs undersized; mon server2 is low on
>> available space
>> OBJECT_MISPLACED 263745/3006663 objects misplaced (8.772%)
>> OBJECT_UNFOUND 9377/1002221 objects unfound (0.936%)
>> pg 4.fa has 117 unfound objects
>> pg 4.ff has 107 unfound objects
>> pg 4.fd has 113 unfound objects
>> pg 4.f0 has 120 unfound objects
>> 
>>
>>
>> Output from ceph pg 4.fa query:
>>
>> {
>> "state": "recovery_wait+undersized+degraded+remapped+peered",
>> "snap_trimq": "[]",
>> "epoch": 17561,
>> "up": [
>> 8,
>> 17,
>> 25
>> ],
>> "acting": [
>> 61
>> ],
>> "backfill_targets": [
>> "8",
>> "17",
>> "25"
>> ],
>> "actingbackfill": [
>> "8",
>> "17",
>> "25",
>> "61"
>> ],
>> "info": {
>> "pgid": "4.fa",
>> "last_update": "17529'85051",
>> "last_complete": "17217'77468",
>> "log_tail": "17091'75034",
>> "last_user_version": 85051,
>> "last_backfill": "MAX",
>> "last_backfill_bitwise": 0,
>> "purged_snaps": [
>> {
>> "start": "1",
>> "length": "3"
>> },
>> {
>> "start": "6",
>>   

Re: [ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-22 Thread David Turner
I have had the same problem before with unfound objects that happened while
backfilling after losing a drive. We didn't lose drives outside of the
failure domains and ultimately didn't lose any data, but we did have to
wait until after all of the PGs in recovery_wait state were caught up.  So
if the 2 disks you lost were in the same host and your CRUSH rules are set
so that you can lose a host without losing data, then the cluster will
likely find all of the objects by the time it's done backfilling.  With
only losing 2 disks, I wouldn't worry about the missing objects not
becoming found unless you're pool size=2.

On Mon, Jan 22, 2018 at 11:47 AM Nico Schottelius <
nico.schottel...@ungleich.ch> wrote:

>
> Hello,
>
> we added about 7 new disks yesterday/today and our cluster became very
> slow. While the rebalancing took place, 2 of the 7 new added disks
> died.
>
> Our cluster is still recovering, however we spotted that there are a lot
> of unfound objects.
>
> We lost osd.63 and osd.64, which seem not to be involved into the sample
> pg that has unfound objects.
>
> We were wondering why there are unfound objects, where they are coming
> from and if there is a way to recover them?
>
> Any help appreciated,
>
> Best,
>
> Nico
>
>
> Our status is:
>
>   cluster:
> id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
> health: HEALTH_WARN
> 261953/3006663 objects misplaced (8.712%)
> 9377/1002221 objects unfound (0.936%)
> Reduced data availability: 176 pgs inactive
> Degraded data redundancy: 609338/3006663 objects degraded
> (20.266%), 243 pgs unclea
> n, 222 pgs degraded, 213 pgs undersized
> mon server2 is low on available space
>
>   services:
> mon: 3 daemons, quorum server5,server3,server2
> mgr: server5(active), standbys: 2, server2, 0, server3
> osd: 54 osds: 54 up, 54 in; 234 remapped pgs
>
>   data:
> pools:   3 pools, 1344 pgs
> objects: 978k objects, 3823 GB
> usage:   9350 GB used, 40298 GB / 49648 GB avail
> pgs: 13.095% pgs not active
>  609338/3006663 objects degraded (20.266%)
>  261953/3006663 objects misplaced (8.712%)
>  9377/1002221 objects unfound (0.936%)
>  1101 active+clean
>  84   recovery_wait+undersized+degraded+remapped+peered
>  82   undersized+degraded+remapped+backfill_wait+peered
>  23   active+undersized+degraded+remapped+backfill_wait
>  18   active+remapped+backfill_wait
>  14   active+undersized+degraded+remapped+backfilling
>  10   undersized+degraded+remapped+backfilling+peered
>  9active+recovery_wait+degraded
>  3active+remapped+backfilling
>
>   io:
> client:   624 kB/s rd, 3255 kB/s wr, 22 op/s rd, 66 op/s wr
> recovery: 90148 kB/s, 22 objects/s
>
> Looking at the unfound objects:
>
> [17:32:17] server1:~# ceph health detail
> HEALTH_WARN 263745/3006663 objects misplaced (8.772%); 9377/1002221
> objects unfound (0.936%); Reduced data availability: 176 pgs inactive;
> Degraded data redundancy: 612398/3006663 objects degraded (20.368%), 244
> pgs unclean, 223 pgs degraded, 214 pgs undersized; mon server2 is low on
> available space
> OBJECT_MISPLACED 263745/3006663 objects misplaced (8.772%)
> OBJECT_UNFOUND 9377/1002221 objects unfound (0.936%)
> pg 4.fa has 117 unfound objects
> pg 4.ff has 107 unfound objects
> pg 4.fd has 113 unfound objects
> pg 4.f0 has 120 unfound objects
> 
>
>
> Output from ceph pg 4.fa query:
>
> {
> "state": "recovery_wait+undersized+degraded+remapped+peered",
> "snap_trimq": "[]",
> "epoch": 17561,
> "up": [
> 8,
> 17,
> 25
> ],
> "acting": [
> 61
> ],
> "backfill_targets": [
> "8",
> "17",
> "25"
> ],
> "actingbackfill": [
> "8",
> "17",
> "25",
> "61"
> ],
> "info": {
> "pgid": "4.fa",
> "last_update": "17529'85051",
> "last_complete": "17217'77468",
> "log_tail": "17091'75034",
> "last_user_version": 85051,
> "last_backfill": "MAX",
> "last_backfill_bitwise": 0,
> "purged_snaps": [
> {
> "start": "1",
> "length": "3"
> },
> {
> "start": "6",
> "length": "8"
> },
> {
> "start": "10",
> "length": "2"
> }
> ],
> "history": {
> "epoch_created": 9134,
> "epoch_pool_created": 9134,
> "last_epoch_started": 17528,
> "last_interval_started": 17527,
> "last_epoch_clean": 17079,
> "last_interval_clean": 17078,
> "last_epoch_split": 0,
> "last_epoch_marked_full": 0,
> "same_up_since": 17143,

[ceph-users] Adding disks -> getting unfound objects [Luminous]

2018-01-22 Thread Nico Schottelius

Hello,

we added about 7 new disks yesterday/today and our cluster became very
slow. While the rebalancing took place, 2 of the 7 new added disks
died.

Our cluster is still recovering, however we spotted that there are a lot
of unfound objects.

We lost osd.63 and osd.64, which seem not to be involved into the sample
pg that has unfound objects.

We were wondering why there are unfound objects, where they are coming
from and if there is a way to recover them?

Any help appreciated,

Best,

Nico


Our status is:

  cluster:
id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
health: HEALTH_WARN
261953/3006663 objects misplaced (8.712%)
9377/1002221 objects unfound (0.936%)
Reduced data availability: 176 pgs inactive
Degraded data redundancy: 609338/3006663 objects degraded 
(20.266%), 243 pgs unclea
n, 222 pgs degraded, 213 pgs undersized
mon server2 is low on available space

  services:
mon: 3 daemons, quorum server5,server3,server2
mgr: server5(active), standbys: 2, server2, 0, server3
osd: 54 osds: 54 up, 54 in; 234 remapped pgs

  data:
pools:   3 pools, 1344 pgs
objects: 978k objects, 3823 GB
usage:   9350 GB used, 40298 GB / 49648 GB avail
pgs: 13.095% pgs not active
 609338/3006663 objects degraded (20.266%)
 261953/3006663 objects misplaced (8.712%)
 9377/1002221 objects unfound (0.936%)
 1101 active+clean
 84   recovery_wait+undersized+degraded+remapped+peered
 82   undersized+degraded+remapped+backfill_wait+peered
 23   active+undersized+degraded+remapped+backfill_wait
 18   active+remapped+backfill_wait
 14   active+undersized+degraded+remapped+backfilling
 10   undersized+degraded+remapped+backfilling+peered
 9active+recovery_wait+degraded
 3active+remapped+backfilling

  io:
client:   624 kB/s rd, 3255 kB/s wr, 22 op/s rd, 66 op/s wr
recovery: 90148 kB/s, 22 objects/s

Looking at the unfound objects:

[17:32:17] server1:~# ceph health detail
HEALTH_WARN 263745/3006663 objects misplaced (8.772%); 9377/1002221 objects 
unfound (0.936%); Reduced data availability: 176 pgs inactive; Degraded data 
redundancy: 612398/3006663 objects degraded (20.368%), 244 pgs unclean, 223 pgs 
degraded, 214 pgs undersized; mon server2 is low on available space
OBJECT_MISPLACED 263745/3006663 objects misplaced (8.772%)
OBJECT_UNFOUND 9377/1002221 objects unfound (0.936%)
pg 4.fa has 117 unfound objects
pg 4.ff has 107 unfound objects
pg 4.fd has 113 unfound objects
pg 4.f0 has 120 unfound objects



Output from ceph pg 4.fa query:

{
"state": "recovery_wait+undersized+degraded+remapped+peered",
"snap_trimq": "[]",
"epoch": 17561,
"up": [
8,
17,
25
],
"acting": [
61
],
"backfill_targets": [
"8",
"17",
"25"
],
"actingbackfill": [
"8",
"17",
"25",
"61"
],
"info": {
"pgid": "4.fa",
"last_update": "17529'85051",
"last_complete": "17217'77468",
"log_tail": "17091'75034",
"last_user_version": 85051,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [
{
"start": "1",
"length": "3"
},
{
"start": "6",
"length": "8"
},
{
"start": "10",
"length": "2"
}
],
"history": {
"epoch_created": 9134,
"epoch_pool_created": 9134,
"last_epoch_started": 17528,
"last_interval_started": 17527,
"last_epoch_clean": 17079,
"last_interval_clean": 17078,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 17143,
"same_interval_since": 17530,
"same_primary_since": 17515,
"last_scrub": "17090'57357",
"last_scrub_stamp": "2018-01-20 20:45:32.616142",
"last_deep_scrub": "17082'54734",
"last_deep_scrub_stamp": "2018-01-15 21:09:34.121488",
"last_clean_scrub_stamp": "2018-01-20 20:45:32.616142"
},
"stats": {
"version": "17529'85051",
"reported_seq": "218453",
"reported_epoch": "17561",
"state": "recovery_wait+undersized+degraded+remapped+peered",
"last_fresh": "2018-01-22 17:42:28.196701",
"last_change": "2018-01-22 15:00:46.507189",
"last_active": "2018-01-22 15:00:44.635399",
"last_peered": "2018-01-22 17:42:28.196701",
"last_clean": "2018-01-21 20:15:48.267209",
"last_became_active": "2018-01-22 14:53:07.918893",
"last_became_peered