Re: [ceph-users] fixing another remapped+incomplete EC 4+2 pg

2018-10-19 Thread Gregory Farnum
On Thu, Oct 18, 2018 at 2:28 PM Graham Allan  wrote:

> Thanks Greg,
>
> This did get resolved though I'm not 100% certain why!
>
> For one of the suspect shards which caused crash on backfill, I
> attempted to delete the associated via s3, late last week. I then
> examined the filestore OSDs and the file shards were still present...
> maybe for an hour following (after which I stopped looking).
>
> I left the cluster set to nobackfill over the weekend, during which time
> all osds kept running; then on Monday morning re-enabled backfill. I
> expected the osd to crash again. after which I could look into moving or
> deleting the implicated backfill shards out of the way. Instead of which
> it happily backfilled its way to cleanliness.
>
> I suppose it's possible the shards got deleted later in some kind of rgw
> gc operation, and this could have cleared the problem? Unfortunately I
> didn't look for them again before re-enabling backfill. I'm not sure if
> that's how s3 object deletion works - does it make any sense?
>

Yes, RGW generates garbage collection logs that are followed along and
perform actual object deletes separately from when they're marked deleted
in the S3 protocol. I don't know the details of the process but it's
entirely plausible that it just went through and deleted all the bad
objects during that time period.
-Greg


>
> The only other thing I did late last week was notice that one of the
> active osds for the pg seemed very slow to respond - the drive was
> clearly failing. I was never getting any actual i/o errors at the user
> or osd level, though it did trigger a 24-hour deathwatch SMART warning a
> bit later.
>
> I exported the pg shard from the failing osd, and re-imported it to
> another otherwise-evacuated osd. This was just for data safety; it seems
> really unlikely this could be causing the other osds in the pg to crash...
>
> Graham
>
> On 10/15/2018 01:44 PM, Gregory Farnum wrote:
> >
> >
> > On Thu, Oct 11, 2018 at 3:22 PM Graham Allan  >
> > As the osd crash implies, setting "nobackfill" appears to let all the
> > osds keep running and the pg stays active and can apparently serve
> data.
> >
> > If I track down the object referenced below in the object store, I
> can
> > download it without error via s3... though as I can't generate a
> > matching etag, it may well be corrupt.
> >
> > Still I do wonder if deleting this object - either via s3, or maybe
> > more
> > likely directly within filestore, might permit backfill to continue.
> >
> >
> > Yes, that is very likely! (...unless there are a bunch of other objects
> > with the same issue.)
> >
> > I'm not immediately familiar with the crash asserts you're seeing, but
> > it certainly looks like somehow the object data didn't quite get stored
> > correctly as the metadata understands it. Perhaps a write got
> > lost/missed on m+1 of the PG shards, setting the
> > osd_find_best_info_ignore_history_les caused it to try and recover from
> > what it had rather than following normal recovery procedures, and now
> > it's not working.
> > -Greg
>
>
> --
> Graham Allan
> Minnesota Supercomputing Institute - g...@umn.edu
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fixing another remapped+incomplete EC 4+2 pg

2018-10-18 Thread Graham Allan

Thanks Greg,

This did get resolved though I'm not 100% certain why!

For one of the suspect shards which caused crash on backfill, I 
attempted to delete the associated via s3, late last week. I then 
examined the filestore OSDs and the file shards were still present... 
maybe for an hour following (after which I stopped looking).


I left the cluster set to nobackfill over the weekend, during which time 
all osds kept running; then on Monday morning re-enabled backfill. I 
expected the osd to crash again. after which I could look into moving or 
deleting the implicated backfill shards out of the way. Instead of which 
it happily backfilled its way to cleanliness.


I suppose it's possible the shards got deleted later in some kind of rgw 
gc operation, and this could have cleared the problem? Unfortunately I 
didn't look for them again before re-enabling backfill. I'm not sure if 
that's how s3 object deletion works - does it make any sense?


The only other thing I did late last week was notice that one of the 
active osds for the pg seemed very slow to respond - the drive was 
clearly failing. I was never getting any actual i/o errors at the user 
or osd level, though it did trigger a 24-hour deathwatch SMART warning a 
bit later.


I exported the pg shard from the failing osd, and re-imported it to 
another otherwise-evacuated osd. This was just for data safety; it seems 
really unlikely this could be causing the other osds in the pg to crash...


Graham

On 10/15/2018 01:44 PM, Gregory Farnum wrote:



On Thu, Oct 11, 2018 at 3:22 PM Graham Allan 

As the osd crash implies, setting "nobackfill" appears to let all the
osds keep running and the pg stays active and can apparently serve data.

If I track down the object referenced below in the object store, I can
download it without error via s3... though as I can't generate a
matching etag, it may well be corrupt.

Still I do wonder if deleting this object - either via s3, or maybe
more
likely directly within filestore, might permit backfill to continue.


Yes, that is very likely! (...unless there are a bunch of other objects 
with the same issue.)


I'm not immediately familiar with the crash asserts you're seeing, but 
it certainly looks like somehow the object data didn't quite get stored 
correctly as the metadata understands it. Perhaps a write got 
lost/missed on m+1 of the PG shards, setting the 
osd_find_best_info_ignore_history_les caused it to try and recover from 
what it had rather than following normal recovery procedures, and now 
it's not working.

-Greg



--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fixing another remapped+incomplete EC 4+2 pg

2018-10-15 Thread Gregory Farnum
On Thu, Oct 11, 2018 at 3:22 PM Graham Allan  wrote:

> As the osd crash implies, setting "nobackfill" appears to let all the
> osds keep running and the pg stays active and can apparently serve data.
>
> If I track down the object referenced below in the object store, I can
> download it without error via s3... though as I can't generate a
> matching etag, it may well be corrupt.
>
> Still I do wonder if deleting this object - either via s3, or maybe more
> likely directly within filestore, might permit backfill to continue.
>

Yes, that is very likely! (...unless there are a bunch of other objects
with the same issue.)

I'm not immediately familiar with the crash asserts you're seeing, but it
certainly looks like somehow the object data didn't quite get stored
correctly as the metadata understands it. Perhaps a write got lost/missed
on m+1 of the PG shards, setting the osd_find_best_info_ignore_history_les
caused it to try and recover from what it had rather than following normal
recovery procedures, and now it's not working.
-Greg


>
> Of other objects referenced in "last_backfill" for each osd in the pg,
> both also download via s3, but one I have seen implicated in similar
> crash logs for otehr OSDs and the etag again does not match; the other I
> have not seen in crash logs and does generate a matching etag.
>
> Opened a tracker issue for this: http://tracker.ceph.com/issues/36411
>
> Graham
>
> On 10/09/2018 06:55 PM, Graham Allan wrote:
> >
> > On 10/09/2018 01:14 PM, Graham Allan wrote:
> >>
> >> On 10/9/2018 12:19 PM, Gregory Farnum wrote:
> >>>
> >>> I think unfortunately the easiest thing for you to fix this will be
> >>> to set the min_size back to 4 until the PG is recovered (or at least
> >>> has 5 shards done). This will be fixed in a later version of Ceph and
> >>> probably backported, but sadly it's not done yet.
> >>> -Greg
> >
> >> Thanks Greg, though sadly I've tried that; whatever I do, one of the 4
> >> osds involved will simply crash (not just the ones I previously tried
> >> to re-import via ceph-objectstore-tool). I just spend time chasing
> >> them around but never succeeding in having a complete set run long
> >> enough to make progress. They seem to crash when starting backfill on
> >> the next object. There has to be something in the current set of
> >> shards which it can't handle.
> >>
> >> Since then I've been focusing on trying to get the pg to revert to an
> >> earlier interval using osd_find_best_info_ignore_history_les, though
> >> the information I find around it is minimal.
> >
> > Since my experiments with osd_find_best_info_ignore_history_les have not
> > borne any results, I'm looking again at the osd crashes when I get
> > enough of them running for backfill to start.
> >
> > They all crash in the same way; with "debug osd=10", the very last bit
> is:
> >
> >> -2> 2018-10-09 16:27:25.425004 7faa866bd700 10 osd.190 pg_epoch:
> >> 710808 pg[70.82ds2( v 710799'704745 (586066'698574,710799'704745]
> >> local-lis/les=710807/710808 n=102929 ec=21494/21494 lis/c
> >> 710807/588565 les/c/f 710808/588566/0 710711/710807/710807)
> >> [820,761,105,789,562,485]/[2147483647 <(214)%20748-3647>,2147483647
> <(214)%20748-3647>,190,448,61,315]p190(2) r=2
> >> lpr=710807 pi=[588565,710807)/39 rops=1
> >> bft=105(2),485(5),562(4),761(1),789(3),820(0) crt=710799'70
> >> 4745 lcod 0'0 mlcod 0'0
> >> active+undersized+degraded+remapped+backfilling] continue_recovery_op:
> >> continuing
> >>
> RecoveryOp(hoid=70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head
>
> >> v=579167'695462 missing_on=105(2),485(5),562(4),761(1),789(3),820(0)
> >> missing_on_shards=0,1,2,3,4,5 recovery_inf
> >>
> o=ObjectRecoveryInfo(70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head@579167'695462,
>
> >> size: 0, copy_subset: [], clone_subset: {}, snapset: 0=[]:[])
> >> recovery_progress=ObjectRecoveryProgress(first, data_recovered_to:0,
> >> data_complete:false, omap_recovered_to:, omap_complete:true, error
> >> :false) obc refcount=3 state=READING waiting_on_pushes=
> >> extent_requested=0,8388608)
> >> -1> 2018-10-09 16:27:25.425105 7faa866bd700 10 osd.190 pg_epoch:
> >> 710808 pg[70.82ds2( v 710799'704745 (586066'698574,710799'704745]
> >> local-lis/les=710807/710808 n=102929 ec=21494/21494 lis/c
> >> 710807/588565 les/c/f 710808/588566/0 710711/710807/710807)
> >> [820,761,105,789,562,485]/[2147483647 <(214)%20748-3647>,2147483647
> <(214)%20748-3647>,190,448,61,315]p190(2) r=2
> >> lpr=710807 pi=[588565,710807)/39 rops=1
> >> bft=105(2),485(5),562(4),761(1),789(3),820(0) crt=710799'70
> >> 4745 lcod 0'0 mlcod 0'0
> >> active+undersized+degraded+remapped+backfilling] continue_recovery_op:
> >> 

Re: [ceph-users] fixing another remapped+incomplete EC 4+2 pg

2018-10-11 Thread Graham Allan
As the osd crash implies, setting "nobackfill" appears to let all the 
osds keep running and the pg stays active and can apparently serve data.


If I track down the object referenced below in the object store, I can 
download it without error via s3... though as I can't generate a 
matching etag, it may well be corrupt.


Still I do wonder if deleting this object - either via s3, or maybe more 
likely directly within filestore, might permit backfill to continue.


Of other objects referenced in "last_backfill" for each osd in the pg, 
both also download via s3, but one I have seen implicated in similar 
crash logs for otehr OSDs and the etag again does not match; the other I 
have not seen in crash logs and does generate a matching etag.


Opened a tracker issue for this: http://tracker.ceph.com/issues/36411

Graham

On 10/09/2018 06:55 PM, Graham Allan wrote:


On 10/09/2018 01:14 PM, Graham Allan wrote:


On 10/9/2018 12:19 PM, Gregory Farnum wrote:


I think unfortunately the easiest thing for you to fix this will be 
to set the min_size back to 4 until the PG is recovered (or at least 
has 5 shards done). This will be fixed in a later version of Ceph and 
probably backported, but sadly it's not done yet.

-Greg


Thanks Greg, though sadly I've tried that; whatever I do, one of the 4 
osds involved will simply crash (not just the ones I previously tried 
to re-import via ceph-objectstore-tool). I just spend time chasing 
them around but never succeeding in having a complete set run long 
enough to make progress. They seem to crash when starting backfill on 
the next object. There has to be something in the current set of 
shards which it can't handle.


Since then I've been focusing on trying to get the pg to revert to an 
earlier interval using osd_find_best_info_ignore_history_les, though 
the information I find around it is minimal.


Since my experiments with osd_find_best_info_ignore_history_les have not 
borne any results, I'm looking again at the osd crashes when I get 
enough of them running for backfill to start.


They all crash in the same way; with "debug osd=10", the very last bit is:

    -2> 2018-10-09 16:27:25.425004 7faa866bd700 10 osd.190 pg_epoch: 
710808 pg[70.82ds2( v 710799'704745 (586066'698574,710799'704745] 
local-lis/les=710807/710808 n=102929 ec=21494/21494 lis/c 
710807/588565 les/c/f 710808/588566/0 710711/710807/710807) 
[820,761,105,789,562,485]/[2147483647,2147483647,190,448,61,315]p190(2) r=2 
lpr=710807 pi=[588565,710807)/39 rops=1 
bft=105(2),485(5),562(4),761(1),789(3),820(0) crt=710799'70
4745 lcod 0'0 mlcod 0'0 
active+undersized+degraded+remapped+backfilling] continue_recovery_op: 
continuing 
RecoveryOp(hoid=70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head 
v=579167'695462 missing_on=105(2),485(5),562(4),761(1),789(3),820(0) 
missing_on_shards=0,1,2,3,4,5 recovery_inf
o=ObjectRecoveryInfo(70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head@579167'695462, 
size: 0, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]) 
recovery_progress=ObjectRecoveryProgress(first, data_recovered_to:0, 
data_complete:false, omap_recovered_to:, omap_complete:true, error
:false) obc refcount=3 state=READING waiting_on_pushes= 
extent_requested=0,8388608)
    -1> 2018-10-09 16:27:25.425105 7faa866bd700 10 osd.190 pg_epoch: 
710808 pg[70.82ds2( v 710799'704745 (586066'698574,710799'704745] 
local-lis/les=710807/710808 n=102929 ec=21494/21494 lis/c 
710807/588565 les/c/f 710808/588566/0 710711/710807/710807) 
[820,761,105,789,562,485]/[2147483647,2147483647,190,448,61,315]p190(2) r=2 
lpr=710807 pi=[588565,710807)/39 rops=1 
bft=105(2),485(5),562(4),761(1),789(3),820(0) crt=710799'70
4745 lcod 0'0 mlcod 0'0 
active+undersized+degraded+remapped+backfilling] continue_recovery_op: 
before_progress=ObjectRecoveryProgress(first, data_recovered_to:0, 
data_complete:false, omap_recovered_to:, omap_complete:true, 
error:false), after_progress=ObjectRecoveryProgress(!first, 
data_recovered_to:0, data_complete:true, omap_recovered_to:, 
omap_complete:true, error:false), pop.data.length()=786432, size=0
 0> 2018-10-09 16:27:25.430239 7faa866bd700 -1 
/build/ceph-12.2.8/src/osd/ECBackend.cc: In function 'void 
ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, 
RecoveryMessages*)' thread 7faa866bd700 time 2018-10-09 16:27:25.425135
/build/ceph-12.2.8/src/osd/ECBackend.cc: 619: FAILED 
assert(pop.data.length() == 
sinfo.aligned_logical_offset_to_chunk_offset( 
after_progress.data_recovered_to - 
op.recovery_progress.data_recovered_to))


It sounds like we're getting an unexpected data chunk size, presumably 
due to some kind of data corruption.


Though if I search on all acting OSDs (up or down) for the 

Re: [ceph-users] fixing another remapped+incomplete EC 4+2 pg

2018-10-09 Thread Graham Allan




On 10/09/2018 01:14 PM, Graham Allan wrote:



On 10/9/2018 12:19 PM, Gregory Farnum wrote:


I think unfortunately the easiest thing for you to fix this will be to 
set the min_size back to 4 until the PG is recovered (or at least has 
5 shards done). This will be fixed in a later version of Ceph and 
probably backported, but sadly it's not done yet.

-Greg


Thanks Greg, though sadly I've tried that; whatever I do, one of the 4 
osds involved will simply crash (not just the ones I previously tried to 
re-import via ceph-objectstore-tool). I just spend time chasing them 
around but never succeeding in having a complete set run long enough to 
make progress. They seem to crash when starting backfill on the next 
object. There has to be something in the current set of shards which it 
can't handle.


Since then I've been focusing on trying to get the pg to revert to an 
earlier interval using osd_find_best_info_ignore_history_les, though the 
information I find around it is minimal.


Since my experiments with osd_find_best_info_ignore_history_les have not 
borne any results, I'm looking again at the osd crashes when I get 
enough of them running for backfill to start.


They all crash in the same way; with "debug osd=10", the very last bit is:


-2> 2018-10-09 16:27:25.425004 7faa866bd700 10 osd.190 pg_epoch: 710808 
pg[70.82ds2( v 710799'704745 (586066'698574,710799'704745] 
local-lis/les=710807/710808 n=102929 ec=21494/21494 lis/c 710807/588565 les/c/f 
710808/588566/0 710711/710807/710807) 
[820,761,105,789,562,485]/[2147483647,2147483647,190,448,61,315]p190(2) r=2 
lpr=710807 pi=[588565,710807)/39 rops=1 
bft=105(2),485(5),562(4),761(1),789(3),820(0) crt=710799'70
4745 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling] 
continue_recovery_op: continuing 
RecoveryOp(hoid=70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head
 v=579167'695462 missing_on=105(2),485(5),562(4),761(1),789(3),820(0) 
missing_on_shards=0,1,2,3,4,5 recovery_inf
o=ObjectRecoveryInfo(70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head@579167'695462,
 size: 0, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]) 
recovery_progress=ObjectRecoveryProgress(first, data_recovered_to:0, 
data_complete:false, omap_recovered_to:, omap_complete:true, error
:false) obc refcount=3 state=READING waiting_on_pushes= 
extent_requested=0,8388608)
-1> 2018-10-09 16:27:25.425105 7faa866bd700 10 osd.190 pg_epoch: 710808 
pg[70.82ds2( v 710799'704745 (586066'698574,710799'704745] 
local-lis/les=710807/710808 n=102929 ec=21494/21494 lis/c 710807/588565 les/c/f 
710808/588566/0 710711/710807/710807) 
[820,761,105,789,562,485]/[2147483647,2147483647,190,448,61,315]p190(2) r=2 
lpr=710807 pi=[588565,710807)/39 rops=1 
bft=105(2),485(5),562(4),761(1),789(3),820(0) crt=710799'70
4745 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling] 
continue_recovery_op: before_progress=ObjectRecoveryProgress(first, 
data_recovered_to:0, data_complete:false, omap_recovered_to:, 
omap_complete:true, error:false), after_progress=ObjectRecoveryProgress(!first, 
data_recovered_to:0, data_complete:true, omap_recovered_to:, 
omap_complete:true, error:false), pop.data.length()=786432, size=0
 0> 2018-10-09 16:27:25.430239 7faa866bd700 -1 
/build/ceph-12.2.8/src/osd/ECBackend.cc: In function 'void 
ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)' thread 
7faa866bd700 time 2018-10-09 16:27:25.425135
/build/ceph-12.2.8/src/osd/ECBackend.cc: 619: FAILED assert(pop.data.length() 
== sinfo.aligned_logical_offset_to_chunk_offset( 
after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))


It sounds like we're getting an unexpected data chunk size, presumably 
due to some kind of data corruption.


Though if I search on all acting OSDs (up or down) for the object on 
which backfill crashed, I do find all shards and they do at least have a 
consistent size... perhaps that size itself is different from expected.


osd 233:
root@ceph01:/var/lib/ceph/osd/ceph-233/current# find 70.82ds1_head -name '*NDARINV6BVVAY29*' -type f -ls
23624833062  772 -rw-r--r--   1 root root   786432 Sep 26 15:32 

Re: [ceph-users] fixing another remapped+incomplete EC 4+2 pg

2018-10-09 Thread Graham Allan



On 10/9/2018 12:19 PM, Gregory Farnum wrote:
On Wed, Oct 3, 2018 at 10:18 AM Graham Allan > wrote:


However I have one pg which is stuck in state remapped+incomplete
because it has only 4 out of 6 osds running, and I have been unable to
bring the missing two back into service.

 > PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg
incomplete
 >     pg 70.82d is remapped+incomplete, acting [2147483647
,2147483647
,190,448,61,315] (reducing pool
.rgw.buckets.ec42 min_size from 5 may help; search ceph.com/docs
 for 'incomplete')

I don't think I want to do anything with min_size as that would make
all
other pgs vulnerable to running dangerously undersized (unless there is
any way to force that state for only a single pg). It seems to me that
with 4/6 osds available, it should maybe be possible to force ceph to
select one or two new osds to rebalance this pg to?


I think unfortunately the easiest thing for you to fix this will be to 
set the min_size back to 4 until the PG is recovered (or at least has 5 
shards done). This will be fixed in a later version of Ceph and probably 
backported, but sadly it's not done yet.

-Greg
Thanks Greg, though sadly I've tried that; whatever I do, one of the 4 
osds involved will simply crash (not just the ones I previously tried to 
re-import via ceph-objectstore-tool). I just spend time chasing them 
around but never succeeding in having a complete set run long enough to 
make progress. They seem to crash when starting backfill on the next 
object. There has to be something in the current set of shards which it 
can't handle.


Since then I've been focusing on trying to get the pg to revert to an 
earlier interval using osd_find_best_info_ignore_history_les, though the 
information I find around it is minimal.


Most sources seem to suggest setting it for the primary osd then either 
setting it down or restarting it, but that simply seems to result in the 
osd disappearing from the pg. After setting this flag for all of the 
"acting" osds (most recent interval), the pg switched to having the set 
of "active" osds == "up" osds, but still "incomplete" (it's not reverted 
to the set of osds in an earlier interval). Still stuck with condition 
"peering_blocked_by_history_les_bound" at present.


I'm guessing that I actually need to set  the flag 
osd_find_best_info_ignore_history_les for *all* osds involved in the 
historical record of this pg (the "probing osds" list?), and restart 
them all...


Still also trying to understand exactly how the flag works. I think I 
see now that the "_les" bit must refer to "last epoch started"...


--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fixing another remapped+incomplete EC 4+2 pg

2018-10-09 Thread Gregory Farnum
On Wed, Oct 3, 2018 at 10:18 AM Graham Allan  wrote:

> Following on from my previous adventure with recovering pgs in the face
> of failed OSDs, I now have my EC 4+2 pool oeprating with min_size=5
> which is as things should be.
>
> However I have one pg which is stuck in state remapped+incomplete
> because it has only 4 out of 6 osds running, and I have been unable to
> bring the missing two back into service.
>
> > PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg incomplete
> > pg 70.82d is remapped+incomplete, acting [2147483647
> <(214)%20748-3647>,2147483647 <(214)%20748-3647>,190,448,61,315]
> (reducing pool .rgw.buckets.ec42 min_size from 5 may help; search
> ceph.com/docs for 'incomplete')
>
> I don't think I want to do anything with min_size as that would make all
> other pgs vulnerable to running dangerously undersized (unless there is
> any way to force that state for only a single pg). It seems to me that
> with 4/6 osds available, it should maybe be possible to force ceph to
> select one or two new osds to rebalance this pg to?
>

I think unfortunately the easiest thing for you to fix this will be to set
the min_size back to 4 until the PG is recovered (or at least has 5 shards
done). This will be fixed in a later version of Ceph and probably
backported, but sadly it's not done yet.
-Greg


>
> ceph pg query gives me (snippet):
>
> > "down_osds_we_would_probe": [
> > 98,
> > 233,
> > 238,
> > 239
> > ],
> > "peering_blocked_by": [],
> > "peering_blocked_by_detail": [
> > {
> > "detail": "peering_blocked_by_history_les_bound"
> > }
> > ]
>
> Of these, osd 98 appears to have a corrupt xfs filesystem
>
> osd 239 was the original osd to hold a shard of this pg but would not
> keep running, exiting with:
>
> > /build/ceph-12.2.7/src/osd/ECBackend.cc: 619: FAILED
> assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset(
> after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))
>
> osds 233 and 238 were otherwise evacuated (weight 0) osds to which I
> imported the pg shard from osd 239 (using ceph-objectstore-tool). After
> which they crash with the same assert. More specifically they seem to
> crash in the same way each time the pg becomes active and starts to
> backfill, on the same object:
>
> > -9> 2018-10-03 11:30:28.174586 7f94ce9c4700  5 osd.233 pg_epoch:
> 704441 pg[70.82ds1( v 704329'703106 (586066'698574,704329'703106]
> local-lis/les=704439/704440 n=102585 ec=21494/21494 lis/c 704439/588565
> les/c/f 704440/588566/0 68066
> > 6/704439/704439) [820,761,105,789,562,485]/[2147483647
> <(214)%20748-3647>,233,190,448,61,315]p233(1) r=1 lpr=704439
> pi=[21494,704439)/4 rops=1 bft=105(2),485(5),562(4),761(1),789(3),820(0)
> crt=704329'703106 lcod 0'0 mlcod 0'0 active+undersized+remapped+ba
> > ckfilling] backfill_pos is
> 70:b415ca14:::default.630943.7__shadow_Barley_GC_Project%2fBarley_GC_Project%2fRawdata%2fReads%2fCZOA.6150.7.38741.TGCTGG.fastq.gz.2~Vn8g0rMwpVY8eaW83TDzJ2mczLXAl3z.3_24:head
> > -8> 2018-10-03 11:30:28.174887 7f94ce9c4700  1 --
> 10.31.0.1:6854/2210291 --> 10.31.0.1:6854/2210291 --
> MOSDECSubOpReadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1,
> attrs_read=0)) v2 -- 0x7f9500472280 con 0
> > -7> 2018-10-03 11:30:28.174902 7f94db9de700  1 --
> 10.31.0.1:6854/2210291 <== osd.233 10.31.0.1:6854/2210291 0 
> MOSDECSubOpReadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1,
> attrs_read=0)) v2  0+0+0 (0 0 0) 0x7f9500472280
> >  con 0x7f94fb72b000
> > -6> 2018-10-03 11:30:28.176267 7f94ead66700  5 --
> 10.31.0.1:6854/2210291 >> 10.31.0.4:6880/2181727 conn(0x7f94ff2a6000 :-1
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=946 cs=1 l=0). rx osd.61
> seq 9 0x7f9500472500 MOSDECSubOpRe
> > adReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2
> > -5> 2018-10-03 11:30:28.176281 7f94ead66700  1 --
> 10.31.0.1:6854/2210291 <== osd.61 10.31.0.4:6880/2181727 9 
> MOSDECSubOpReadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1,
> attrs_read=0)) v2  786745+0+0 (875698380 0 0) 0x
> > 7f9500472500 con 0x7f94ff2a6000
> > -4> 2018-10-03 11:30:28.177723 7f94ead66700  5 --
> 10.31.0.1:6854/2210291 >> 10.31.0.9:6920/13427 conn(0x7f94ff2bc800 :-1
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=46152 cs=1 l=0). rx
> osd.448 seq 8 0x7f94fe9d5980 MOSDECSubOpR
> > eadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2
> > -3> 2018-10-03 11:30:28.177738 7f94ead66700  1 --
> 10.31.0.1:6854/2210291 <== osd.448 10.31.0.9:6920/13427 8 
> MOSDECSubOpReadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1,
> attrs_read=0)) v2  786745+0+0 (2772477454 0 0) 0x
> > 7f94fe9d5980 con 0x7f94ff2bc800
> > -2> 2018-10-03 11:30:28.185788 7f94ea565700  5 --
> 10.31.0.1:6854/2210291 >> 

Re: [ceph-users] fixing another remapped+incomplete EC 4+2 pg

2018-10-09 Thread Graham Allan

Oops, by "periods" I do of course mean "intervals"...!

On 10/8/2018 4:57 PM, Graham Allan wrote:
I'm still trying to find a way to reactivate this one pg which is 
incomplete. There are a lot of periods in its history based on a 
combination of a peering storm a couple of weeks ago, with min_size 
being set too low for safety. At this point I think there is no chance 
of bringing back the full set of most recent osds, so I'd like to 
understand the process to roll back to an earlier period no matter how 
long ago.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fixing another remapped+incomplete EC 4+2 pg

2018-10-08 Thread Graham Allan
I'm still trying to find a way to reactivate this one pg which is 
incomplete. There are a lot of periods in its history based on a 
combination of a peering storm a couple of weeks ago, with min_size 
being set too low for safety. At this point I think there is no chance 
of bringing back the full set of most recent osds, so I'd like to 
understand the process to roll back to an earlier period no matter how 
long ago.


I understood the process is to set 
osd_find_best_info_ignore_history_les=1 for the primary osd, so 
something like:


ceph tell osd.448 injectargs --osd_find_best_info_ignore_history_les=1

then set that osd down to make it re-peer. But whenever I have tried 
this the osd never becomes active again. Possibly I have misunderstood 
or am doing something else wrong...


Output from pg query is here, if it adds any insight...


https://gist.githubusercontent.com/gtallan/e72b4461fb315983ae9a62cbbcd851d4/raw/0d30ceb315dd5567cb05fd0dc3e2e2c4975d8c01/pg70.b1c-query.txt


(Out of curiosity, is there any way to relate the first and last numbers 
in an interval to an actual timestamp?)


Thanks,

Graham

On 10/03/2018 12:18 PM, Graham Allan wrote:
Following on from my previous adventure with recovering pgs in the face 
of failed OSDs, I now have my EC 4+2 pool oeprating with min_size=5 
which is as things should be.


However I have one pg which is stuck in state remapped+incomplete 
because it has only 4 out of 6 osds running, and I have been unable to 
bring the missing two back into service.



PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg incomplete
    pg 70.82d is remapped+incomplete, acting 
[2147483647,2147483647,190,448,61,315] (reducing pool 
.rgw.buckets.ec42 min_size from 5 may help; search ceph.com/docs for 
'incomplete')


I don't think I want to do anything with min_size as that would make all 
other pgs vulnerable to running dangerously undersized (unless there is 
any way to force that state for only a single pg). It seems to me that 
with 4/6 osds available, it should maybe be possible to force ceph to 
select one or two new osds to rebalance this pg to?


ceph pg query gives me (snippet):


    "down_osds_we_would_probe": [
    98,
    233,
    238,
    239
    ],
    "peering_blocked_by": [],
    "peering_blocked_by_detail": [
    {
    "detail": "peering_blocked_by_history_les_bound"
    }
    ]


Of these, osd 98 appears to have a corrupt xfs filesystem

osd 239 was the original osd to hold a shard of this pg but would not 
keep running, exiting with:


/build/ceph-12.2.7/src/osd/ECBackend.cc: 619: FAILED 
assert(pop.data.length() == 
sinfo.aligned_logical_offset_to_chunk_offset( 
after_progress.data_recovered_to - 
op.recovery_progress.data_recovered_to))


osds 233 and 238 were otherwise evacuated (weight 0) osds to which I 
imported the pg shard from osd 239 (using ceph-objectstore-tool). After 
which they crash with the same assert. More specifically they seem to 
crash in the same way each time the pg becomes active and starts to 
backfill, on the same object:


    -9> 2018-10-03 11:30:28.174586 7f94ce9c4700  5 osd.233 pg_epoch: 
704441 pg[70.82ds1( v 704329'703106 (586066'698574,704329'703106] 
local-lis/les=704439/704440 n=102585 ec=21494/21494 lis/c 
704439/588565 les/c/f 704440/588566/0 68066
6/704439/704439) 
[820,761,105,789,562,485]/[2147483647,233,190,448,61,315]p233(1) r=1 
lpr=704439 pi=[21494,704439)/4 rops=1 
bft=105(2),485(5),562(4),761(1),789(3),820(0) crt=704329'703106 lcod 
0'0 mlcod 0'0 active+undersized+remapped+ba
ckfilling] backfill_pos is 
70:b415ca14:::default.630943.7__shadow_Barley_GC_Project%2fBarley_GC_Project%2fRawdata%2fReads%2fCZOA.6150.7.38741.TGCTGG.fastq.gz.2~Vn8g0rMwpVY8eaW83TDzJ2mczLXAl3z.3_24:head 

    -8> 2018-10-03 11:30:28.174887 7f94ce9c4700  1 -- 
10.31.0.1:6854/2210291 --> 10.31.0.1:6854/2210291 -- 
MOSDECSubOpReadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, 
attrs_read=0)) v2 -- 0x7f9500472280 con 0
    -7> 2018-10-03 11:30:28.174902 7f94db9de700  1 -- 
10.31.0.1:6854/2210291 <== osd.233 10.31.0.1:6854/2210291 0  
MOSDECSubOpReadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, 
attrs_read=0)) v2  0+0+0 (0 0 0) 0x7f9500472280

 con 0x7f94fb72b000
    -6> 2018-10-03 11:30:28.176267 7f94ead66700  5 -- 
10.31.0.1:6854/2210291 >> 10.31.0.4:6880/2181727 conn(0x7f94ff2a6000 
:-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=946 cs=1 l=0). 
rx osd.61 seq 9 0x7f9500472500 MOSDECSubOpRe

adReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2
    -5> 2018-10-03 11:30:28.176281 7f94ead66700  1 -- 
10.31.0.1:6854/2210291 <== osd.61 10.31.0.4:6880/2181727 9  
MOSDECSubOpReadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, 
attrs_read=0)) v2  786745+0+0 (875698380 0 0) 0x

7f9500472500 con 0x7f94ff2a6000
    -4> 2018-10-03 11:30:28.177723 7f94ead66700  5 

[ceph-users] fixing another remapped+incomplete EC 4+2 pg

2018-10-03 Thread Graham Allan
Following on from my previous adventure with recovering pgs in the face 
of failed OSDs, I now have my EC 4+2 pool oeprating with min_size=5 
which is as things should be.


However I have one pg which is stuck in state remapped+incomplete 
because it has only 4 out of 6 osds running, and I have been unable to 
bring the missing two back into service.



PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg incomplete
pg 70.82d is remapped+incomplete, acting 
[2147483647,2147483647,190,448,61,315] (reducing pool .rgw.buckets.ec42 
min_size from 5 may help; search ceph.com/docs for 'incomplete')


I don't think I want to do anything with min_size as that would make all 
other pgs vulnerable to running dangerously undersized (unless there is 
any way to force that state for only a single pg). It seems to me that 
with 4/6 osds available, it should maybe be possible to force ceph to 
select one or two new osds to rebalance this pg to?


ceph pg query gives me (snippet):


"down_osds_we_would_probe": [
98,
233,
238,
239
],
"peering_blocked_by": [],
"peering_blocked_by_detail": [
{
"detail": "peering_blocked_by_history_les_bound"
}
]


Of these, osd 98 appears to have a corrupt xfs filesystem

osd 239 was the original osd to hold a shard of this pg but would not 
keep running, exiting with:



/build/ceph-12.2.7/src/osd/ECBackend.cc: 619: FAILED assert(pop.data.length() 
== sinfo.aligned_logical_offset_to_chunk_offset( 
after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))


osds 233 and 238 were otherwise evacuated (weight 0) osds to which I 
imported the pg shard from osd 239 (using ceph-objectstore-tool). After 
which they crash with the same assert. More specifically they seem to 
crash in the same way each time the pg becomes active and starts to 
backfill, on the same object:



-9> 2018-10-03 11:30:28.174586 7f94ce9c4700  5 osd.233 pg_epoch: 704441 
pg[70.82ds1( v 704329'703106 (586066'698574,704329'703106] 
local-lis/les=704439/704440 n=102585 ec=21494/21494 lis/c 704439/588565 les/c/f 
704440/588566/0 68066
6/704439/704439) 
[820,761,105,789,562,485]/[2147483647,233,190,448,61,315]p233(1) r=1 lpr=704439 
pi=[21494,704439)/4 rops=1 bft=105(2),485(5),562(4),761(1),789(3),820(0) 
crt=704329'703106 lcod 0'0 mlcod 0'0 active+undersized+remapped+ba
ckfilling] backfill_pos is 
70:b415ca14:::default.630943.7__shadow_Barley_GC_Project%2fBarley_GC_Project%2fRawdata%2fReads%2fCZOA.6150.7.38741.TGCTGG.fastq.gz.2~Vn8g0rMwpVY8eaW83TDzJ2mczLXAl3z.3_24:head
-8> 2018-10-03 11:30:28.174887 7f94ce9c4700  1 -- 10.31.0.1:6854/2210291 
--> 10.31.0.1:6854/2210291 -- MOSDECSubOpReadReply(70.82ds1 704441/704439 
ECSubReadReply(tid=1, attrs_read=0)) v2 -- 0x7f9500472280 con 0
-7> 2018-10-03 11:30:28.174902 7f94db9de700  1 -- 10.31.0.1:6854/2210291 
<== osd.233 10.31.0.1:6854/2210291 0  MOSDECSubOpReadReply(70.82ds1 
704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2  0+0+0 (0 0 0) 
0x7f9500472280
 con 0x7f94fb72b000
-6> 2018-10-03 11:30:28.176267 7f94ead66700  5 -- 10.31.0.1:6854/2210291 >> 
10.31.0.4:6880/2181727 conn(0x7f94ff2a6000 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=946 cs=1 l=0). rx osd.61 seq 9 
0x7f9500472500 MOSDECSubOpRe
adReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2
-5> 2018-10-03 11:30:28.176281 7f94ead66700  1 -- 10.31.0.1:6854/2210291 
<== osd.61 10.31.0.4:6880/2181727 9  MOSDECSubOpReadReply(70.82ds1 
704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2  786745+0+0 (875698380 0 0) 
0x
7f9500472500 con 0x7f94ff2a6000
-4> 2018-10-03 11:30:28.177723 7f94ead66700  5 -- 10.31.0.1:6854/2210291 >> 
10.31.0.9:6920/13427 conn(0x7f94ff2bc800 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=46152 cs=1 l=0). rx osd.448 seq 8 
0x7f94fe9d5980 MOSDECSubOpR
eadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2
-3> 2018-10-03 11:30:28.177738 7f94ead66700  1 -- 10.31.0.1:6854/2210291 
<== osd.448 10.31.0.9:6920/13427 8  MOSDECSubOpReadReply(70.82ds1 
704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2  786745+0+0 (2772477454 0 
0) 0x
7f94fe9d5980 con 0x7f94ff2bc800
-2> 2018-10-03 11:30:28.185788 7f94ea565700  5 -- 10.31.0.1:6854/2210291 >> 
10.31.0.7:6868/2012671 conn(0x7f94ff5c3800 :6854 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=4193 cs=1 l=0). rx osd.190 seq 10 
0x7f9500472780 MOSDECSu
bOpReadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2
-1> 2018-10-03 11:30:28.185815 7f94ea565700  1 -- 10.31.0.1:6854/2210291 
<== osd.190 10.31.0.7:6868/2012671 10  MOSDECSubOpReadReply(70.82ds1 
704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2  786745+0+0 (2670842780 0 0)
 0x7f9500472780 con 0x7f94ff5c3800
 0> 2018-10-03 11:30:28.194795 7f94ce9c4700 -1