Re: [ceph-users] fast_read in EC pools

Gregory Farnum Mon, 26 Feb 2018 14:49:10 -0800

On Mon, Feb 26, 2018 at 2:30 PM Oliver Freyermuth <
[email protected]> wrote:


> Am 26.02.2018 um 23:15 schrieb Gregory Farnum:
> >
> >
> > On Mon, Feb 26, 2018 at 11:48 AM Oliver Freyermuth <
> [email protected] <mailto:[email protected]>>
> wrote:
> >
> >     >     >     The EC pool I am considering is k=4 m=2 with failure
> domain host, on 6 hosts.
> >     >     >     So necessarily, there is one shard for each host. If one
> host goes down for a prolonged time,
> >     >     >     there's no "logical" advantage of redistributing things
> - since whatever you do, with 5 hosts, all PGs will stay in degraded state
> anyways.
> >     >     >
> >     >     >     However, I noticed Ceph is remapping all PGs, and
> actively moving data. I presume now this is done for two reasons:
> >     >     >     - The remapping is needed since the primary OSD might be
> the one which went down. But for remapping (I guess) there's no need to
> actually move data,
> >     >     >       or is there?
> >     >     >     - The data movement is done to have the "k" shards
> available.
> >     >     >     If it's really the case that "all shards are equal",
> then data movement should not occur - or is this a bug / bad feature?
> >     >     >
> >     >     >
> >     >     > If you lose one OSD out of a host, Ceph is going to try and
> re-replicate the data onto the other OSDs in that host. Your PG size and
> the CRUSH rule instructs it that the PG needs 6 different OSDs, and those
> OSDs need to be placed on different hosts.
> >     >     >
> >     >     > You're right that gets very funny if your PG size is equal
> to the number of hosts. We generally discourage people from running
> configurations like that.
> >     >
> >     >     Yes. k=4 with m=2 with 6 hosts (i.e. possibility to lose 2
> hosts) would be our starting point - since we may add more hosts later (not
> too soon-ish, but it's not excluded more may come in a year or so),
> >     >     and migrating large EC pools to different settings still seems
> a bit messy.
> >     >     We can't really afford to reduce available storage
> significantly more in the current setup, and would like to have the
> possibility to lose one host (for example for an OS upgrade),
> >     >     and then still lose a few disks in case they fail with bad
> timing.
> >     >
> >     >     >
> >     >     > Or if you mean that you are losing a host, and the data is
> shuffling around on the remaining hosts...hrm, that'd be weird. (Perhaps a
> result of EC pools' "indep" rather than "firstn" crush rules?)
> >     >
> >     >     They are indep, which I think is the default (no manual
> editing done). I thought the main goal of indep was exactly to reduce data
> movement.
> >     >     Indeed, it's very funny that data is moved, it certainly does
> not help to increase redundancy ;-).
> >     >
> >     <snip>
> >     >
> >     > Can you also share the output of "ceph osd crush dump"?
> >
> >     Attached.
> >
> >
> > Yep, that all looks simple enough.
> >
> > Do you have any "ceph -s" or other records from when this was occurring?
> Is it actually deleting or migrating any of the existing shards, or is it
> just that the shards which were previously on the out'ed OSDs are now
> getting copied onto the remaining ones?
> >
> > I think I finally understand what's happening here but would like to be
> sure. :)
> > -Greg
> >
> > (In short: certain straws were previously mapping onto osd.[outed], but
> now they map onto the remaining OSDs. Because everything's independent, the
> actual CRUSH mapping for any shard other than the last is now going to map
> onto a remaining OSD, which would displace the shard it already holds. But
> the previously-present shard is going to remain "remapped" there because it
> can't map successfully. So if you lose osd.5, you'll go from a CRUSH
> mapping like [1,3,5,0,2,4] to [1,3,4,0,2,UNMAPPED], but in reality shards 2
> and 5 will both be on OSD 4.)
>
> Interesting! This would also mean that space usage on the remaining-active
> OSDs would increase by 1/6 in our setup, which is significant.
> So that's another good reason to use mon_osd_down_out_subtree_limit=host
> or to just set "ceph osd set noout" when actively reinstalling a host.
>
> I reproduced just now. Here's what I see (ignore the inconsistent PG,
> that's unrelated and likely a cause of previous OSD OOM issues):
> # ceph -s
>   cluster:
>     id:     69b1fbe5-f084-4410-a99a-ab57417e7846
>     health: HEALTH_ERR
>             41569430/513248666 objects misplaced (8.099%)
>             1 scrub errors
>             Possible data damage: 1 pg inconsistent
>             Degraded data redundancy: 105575103/513248666 objects degraded
> (20.570%), 2176 pgs degraded, 985 pgs undersized
>
>   services:
>     mon: 3 daemons, quorum mon003,mon001,mon002
>     mgr: mon002(active), standbys: mon001, mon003
>     mds: cephfs_baf-1/1/1 up  {0=mon002=up:active}, 1 up:standby-replay, 1
> up:standby
>     osd: 196 osds: 164 up, 164 in; 1166 remapped pgs
>
>   data:
>     pools:   2 pools, 2176 pgs
>     objects: 89370k objects, 4488 GB
>     usage:   29546 GB used, 555 TB / 584 TB avail
>     pgs:     105575103/513248666 objects degraded (20.570%)
>              41569430/513248666 objects misplaced (8.099%)
>              1166 active+undersized+degraded+remapped+backfilling
>              1009 active+undersized+degraded
>              1    active+undersized+degraded+inconsistent
>
>   io:
>     client:   6784 kB/s rd, 6820 kB/s wr, 804 op/s rd, 1174 op/s wr
>     recovery: 79333 kB/s, 27 keys/s, 1080 objects/s
>
> In ceph health detail, I see:
>     pg 2.7cd is active+undersized+degraded+remapped+backfilling, acting
> [91,63,33,163,2147483647 <(214)%20748-3647>,103]
>     pg 2.7ce is stuck undersized for 114.063431, current state
> active+undersized+degraded+remapped+backfilling, last acting [31,121,157,
> 2147483647 <(214)%20748-3647>,61,87]
>     pg 2.7cf is stuck undersized for 110.842287, current state
> active+undersized+degraded+remapped+backfilling, last acting [163,36,
> 2147483647 <(214)%20748-3647>,21,124,69]
>     pg 2.7d0 is stuck undersized for 118.876276, current state
> active+undersized+degraded+remapped+backfilling, last acting [140,91,66,22,
> 2147483647 <(214)%20748-3647>,112]
>     pg 2.7d1 is stuck undersized for 388.377010, current state
> active+undersized+degraded, last acting [62,110,2147483647
> <(214)%20748-3647>,31,141,81]
>     pg 2.7d2 is stuck undersized for 111.265718, current state
> active+undersized+degraded+remapped+backfilling, last acting [54,125,
> 2147483647 <(214)%20748-3647>,157,88,21]
>     pg 2.7d3 is stuck undersized for 105.885607, current state
> active+undersized+degraded+remapped+backfilling, last acting [20,117,96,
> 2147483647 <(214)%20748-3647>,144,54]
>     pg 2.7d4 is stuck undersized for 112.693680, current state
> active+undersized+degraded+remapped+backfilling, last acting [105,145,71,60,
> 2147483647 <(214)%20748-3647>,13]
>     pg 2.7d5 is stuck undersized for 388.337919, current state
> active+undersized+degraded, last acting [142,90,19,60,2147483647,127]
> [...]
> While I saw, when the host's OSDs were only down, but still in:
>     pg 2.7cd is active+undersized+degraded, acting [91,63,33,163,
> 2147483647 <(214)%20748-3647>,103]
>     pg 2.7ce is stuck undersized for 145.507311, current state
> active+undersized+degraded, last acting [31,121,157,2147483647
> <(214)%20748-3647>,61,87]
>     pg 2.7cf is stuck undersized for 143.293067, current state
> active+undersized+degraded, last acting [163,36,2147483647
> <(214)%20748-3647>,21,124,69]
>     pg 2.7d0 is stuck undersized for 145.461503, current state
> active+undersized+degraded, last acting [140,91,66,22,2147483647,112]
>     pg 2.7d1 is stuck undersized for 145.496089, current state
> active+undersized+degraded, last acting [62,110,2147483647
> <(214)%20748-3647>,31,141,81]
>     pg 2.7d2 is stuck undersized for 145.513296, current state
> active+undersized+degraded, last acting [54,125,2147483647
> <(214)%20748-3647>,157,88,21]
>     pg 2.7d3 is stuck undersized for 145.503361, current state
> active+undersized+degraded, last acting [20,117,96,2147483647,144,54]
>     pg 2.7d4 is stuck undersized for 145.484259, current state
> active+undersized+degraded, last acting [105,145,71,60,2147483647,13]
>     pg 2.7d5 is stuck undersized for 145.456998, current state
> active+undersized+degraded, last acting [142,90,19,60,2147483647,127]
>
> Does this match expectations?
>

Can you get the output of eg "ceph pg 2.7cd query"? Want to make sure the
backfilling versus acting sets and things are correct.

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] fast_read in EC pools

Reply via email to