Re: [ceph-users] lost osd while migrating EC pool to device-class crush rules

2018-09-18 Thread Graham Allan

On 09/17/2018 04:33 PM, Gregory Farnum wrote:
On Mon, Sep 17, 2018 at 8:21 AM Graham Allan > wrote:


Looking back through history it seems that I *did* override the
min_size
for this pool, however I didn't reduce it - it used to have min_size 2!
That made no sense to me - I think it must be an artifact of a very
early (hammer?) ec pool creation, but it pre-dates me.

I found the documentation on what min_size should be a bit confusing
which is how I arrived at 4. Fully agree that k+1=5 makes way more
sense.

I don't think I was the only one confused by this though, eg
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026445.html

I suppose the safest thing to do is update min_size->5 right away to
force any size-4 pgs down until they can perform recovery. I can set
force-recovery on these as well...


Mmm, this is embarrassing but that actually doesn't quite work due to 
https://github.com/ceph/ceph/pull/24095, which has been on my task list 
but at the bottom for a while. :( So if your cluster is stable now I'd 
let it clean up and then change the min_size once everything is repaired.


Thanks for your feedback, Greg. Since declaring the dead osd as lost, 
the downed pg became active again, and is successfully serving data. The 
cluster is considerably more stable now; I've set force-backfill or 
force-recovery on any size=4 pgs and can wait for that to complete 
before changing anything else.


Thanks again,

Graham
--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] lost osd while migrating EC pool to device-class crush rules

2018-09-17 Thread Gregory Farnum
On Mon, Sep 17, 2018 at 8:21 AM Graham Allan  wrote:

>
>
> On 09/14/2018 02:38 PM, Gregory Farnum wrote:
> > On Thu, Sep 13, 2018 at 3:05 PM, Graham Allan  wrote:
> >>
> >> However I do see transfer errors fetching some files out of radosgw -
> the
> >> transfer just hangs then aborts. I'd guess this probably due to one pg
> stuck
> >> down, due to a lost (failed HDD) osd. I think there is no alternative to
> >> declare the osd lost, but I wish I understood better the implications
> of the
> >> "recovery_state" and "past_intervals" output by ceph pg query:
> >> https://pastebin.com/8WrYLwVt
> >
> > What are you curious about here? The past intervals is listing the
> > OSDs which were involved in the PG since it was last clean, then each
> > acting set and the intervals it was active for.
>
> That's pretty much what I'm looking for, and that the pg can roll back
> to an earlier interval if there were no writes, and the current osd has
> to be declared lost.
>
> >> I find it disturbing/odd that the acting set of osds lists only 3/6
> >> available; implies that without getting one of these back it would be
> >> impossible to recover the data (from 4+2 EC). However the dead osd 98
> only
> >> appears in the most recent (?) interval - presumably during the flapping
> >> period, during which time client writes were unlikely (radosgw
> disabled).
> >>
> >> So if 98 were marked lost would it roll back to the prior interval? I
> am not
> >> certain how to interpret this information!
> >
> > Yes, that’s what should happen if it’s all as you outline here.
> >
> > It *is* quite curious that the PG apparently went active with only 4
> > members in a 4+2 system — it's supposed to require at least k+1 (here,
> > 5) by default. Did you override the min_size or something?
> > -Greg
>
> Looking back through history it seems that I *did* override the min_size
> for this pool, however I didn't reduce it - it used to have min_size 2!
> That made no sense to me - I think it must be an artifact of a very
> early (hammer?) ec pool creation, but it pre-dates me.
>
> I found the documentation on what min_size should be a bit confusing
> which is how I arrived at 4. Fully agree that k+1=5 makes way more sense.
>
> I don't think I was the only one confused by this though, eg
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026445.html
>
> I suppose the safest thing to do is update min_size->5 right away to
> force any size-4 pgs down until they can perform recovery. I can set
> force-recovery on these as well...
>

Mmm, this is embarrassing but that actually doesn't quite work due to
https://github.com/ceph/ceph/pull/24095, which has been on my task list but
at the bottom for a while. :( So if your cluster is stable now I'd let it
clean up and then change the min_size once everything is repaired.


>
> Is there any setting which can permit these pgs to fulfil reads while
> refusing writes when active size=k?
>

No, that's unfortunately infeasible.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] lost osd while migrating EC pool to device-class crush rules

2018-09-17 Thread Graham Allan



On 09/14/2018 02:38 PM, Gregory Farnum wrote:

On Thu, Sep 13, 2018 at 3:05 PM, Graham Allan  wrote:


However I do see transfer errors fetching some files out of radosgw - the
transfer just hangs then aborts. I'd guess this probably due to one pg stuck
down, due to a lost (failed HDD) osd. I think there is no alternative to
declare the osd lost, but I wish I understood better the implications of the
"recovery_state" and "past_intervals" output by ceph pg query:
https://pastebin.com/8WrYLwVt


What are you curious about here? The past intervals is listing the
OSDs which were involved in the PG since it was last clean, then each
acting set and the intervals it was active for.


That's pretty much what I'm looking for, and that the pg can roll back 
to an earlier interval if there were no writes, and the current osd has 
to be declared lost.



I find it disturbing/odd that the acting set of osds lists only 3/6
available; implies that without getting one of these back it would be
impossible to recover the data (from 4+2 EC). However the dead osd 98 only
appears in the most recent (?) interval - presumably during the flapping
period, during which time client writes were unlikely (radosgw disabled).

So if 98 were marked lost would it roll back to the prior interval? I am not
certain how to interpret this information!


Yes, that’s what should happen if it’s all as you outline here.

It *is* quite curious that the PG apparently went active with only 4
members in a 4+2 system — it's supposed to require at least k+1 (here,
5) by default. Did you override the min_size or something?
-Greg


Looking back through history it seems that I *did* override the min_size 
for this pool, however I didn't reduce it - it used to have min_size 2! 
That made no sense to me - I think it must be an artifact of a very 
early (hammer?) ec pool creation, but it pre-dates me.


I found the documentation on what min_size should be a bit confusing 
which is how I arrived at 4. Fully agree that k+1=5 makes way more sense.


I don't think I was the only one confused by this though, eg
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026445.html

I suppose the safest thing to do is update min_size->5 right away to 
force any size-4 pgs down until they can perform recovery. I can set 
force-recovery on these as well...


Is there any setting which can permit these pgs to fulfil reads while 
refusing writes when active size=k?



--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] lost osd while migrating EC pool to device-class crush rules

2018-09-14 Thread Gregory Farnum
On Thu, Sep 13, 2018 at 3:05 PM, Graham Allan  wrote:
> I'm now following up to my earlier message regarding data migration from old
> to new hardware in our ceph cluster. As part of this we wanted to move to
> device-class-based crush rules. For the replicated pools the directions for
> this were straightforward; for our EC pool, it wasn't so clear, but I
> proceeded via the outline below: defining a new EC profile and associated
> crush rule, then finally, last Wednesday (5 Sept) updating the crush rule
> for the pool.
>
> Data migration started well for the first 36 or so hours, then the storage
> backend became very unstable with OSDs dropping out and failing to peer...
> capped off by the mons running out of disk space. After correcting the mon
> disk issue, stability seemed to be restored by dropping down max backfills
> and a few other parameters, eg
>
> osd max backfills = 4
> osd recovery max active = 1
> osd max recovery threads = 1
>
> I also had to stop then restart all OSDs to stop the peering storm, but the
> data migration has been proceeding fine since then.
>
> However I do see transfer errors fetching some files out of radosgw - the
> transfer just hangs then aborts. I'd guess this probably due to one pg stuck
> down, due to a lost (failed HDD) osd. I think there is no alternative to
> declare the osd lost, but I wish I understood better the implications of the
> "recovery_state" and "past_intervals" output by ceph pg query:
> https://pastebin.com/8WrYLwVt

What are you curious about here? The past intervals is listing the
OSDs which were involved in the PG since it was last clean, then each
acting set and the intervals it was active for.

>
> I find it disturbing/odd that the acting set of osds lists only 3/6
> available; implies that without getting one of these back it would be
> impossible to recover the data (from 4+2 EC). However the dead osd 98 only
> appears in the most recent (?) interval - presumably during the flapping
> period, during which time client writes were unlikely (radosgw disabled).
>
> So if 98 were marked lost would it roll back to the prior interval? I am not
> certain how to interpret this information!

Yes, that’s what should happen if it’s all as you outline here.

It *is* quite curious that the PG apparently went active with only 4
members in a 4+2 system — it's supposed to require at least k+1 (here,
5) by default. Did you override the min_size or something?
-Greg


>
> Running luminous 12.2.7 if it makes a difference.
>
> Thanks as always for pointers,
>
> Graham
>
> On 07/18/2018 02:35 PM, Graham Allan wrote:
>>
>> Like many, we have a typical double root crush map, for hdd vs ssd-based
>> pools. We've been running luminous for some time, so in preparation for a
>> migration to new storage hardware, I wanted to migrate our pools to use the
>> new device-class based rules; this way I shouldn't need to perpetuate the
>> double hdd/ssd crush map for new hardware...
>>
>> I understand how to migrate our replicated pools, by creating new
>> replicated crush rules, and migrating them one at a time, but I'm confused
>> on how to do this for erasure pools.
>>
>> I can create a new class-aware EC profile something like:
>>
>>> ceph osd erasure-code-profile set ecprofile42_hdd k=4 m=2
>>> crush-device-class=hdd crush-failure-domain=host
>>
>>
>> then a new crush rule from this:
>>
>>> ceph osd crush rule create-erasure ec42_hdd ecprofile42_hdd
>>
>>
>> So mostly I want to confirm that is is safe to change the crush rule for
>> the EC pool. It seems to make sense, but then, as I understand it, you can't
>> change the erasure code profile for a pool after creation; but this seems to
>> implicitly do so...
>>
>> old rule:
>>>
>>> rule .rgw.buckets.ec42 {
>>> id 17
>>> type erasure
>>> min_size 3
>>> max_size 20
>>> step set_chooseleaf_tries 5
>>> step take platter
>>> step chooseleaf indep 0 type host
>>> step emit
>>> }
>>
>>
>> old ec profile:
>>>
>>> # ceph osd erasure-code-profile get ecprofile42
>>> crush-failure-domain=host
>>> directory=/usr/lib/x86_64-linux-gnu/ceph/erasure-code
>>> k=4
>>> m=2
>>> plugin=jerasure
>>> technique=reed_sol_van
>>
>>
>> new rule:
>>>
>>> rule ec42_hdd {
>>> id 7
>>> type erasure
>>> min_size 3
>>> max_size 6
>>> step set_chooseleaf_tries 5
>>> step set_choose_tries 100
>>> step take default class hdd
>>> step chooseleaf indep 0 type host
>>> step emit
>>> }
>>
>>
>> new ec profile:
>>>
>>> # ceph osd erasure-code-profile get ecprofile42_hdd
>>> crush-device-class=hdd
>>> crush-failure-domain=host
>>> crush-root=default
>>> jerasure-per-chunk-alignment=false
>>> k=4
>>> m=2
>>> plugin=jerasure
>>> technique=reed_sol_van
>>> w=8
>>
>>
>> These are both ec42 but I'm not sure why the old rule has "max size 20"
>> (perhaps because it was generated a long time ago under hammer?).
>>
>> Thanks