Re: [ceph-users] fast_read in EC pools

Oliver Freyermuth Mon, 26 Feb 2018 15:01:12 -0800

Am 26.02.2018 um 23:48 schrieb Gregory Farnum:
> 
> 
> On Mon, Feb 26, 2018 at 2:30 PM Oliver Freyermuth 
> <[email protected] <mailto:[email protected]>> wrote:
> 
>     Am 26.02.2018 um 23:15 schrieb Gregory Farnum:
>     >
>     >
>     > On Mon, Feb 26, 2018 at 11:48 AM Oliver Freyermuth 
> <[email protected] <mailto:[email protected]> 
> <mailto:[email protected] 
> <mailto:[email protected]>>> wrote:
>     >
>     >     >     >     The EC pool I am considering is k=4 m=2 with failure 
> domain host, on 6 hosts.
>     >     >     >     So necessarily, there is one shard for each host. If 
> one host goes down for a prolonged time,
>     >     >     >     there's no "logical" advantage of redistributing things 
> - since whatever you do, with 5 hosts, all PGs will stay in degraded state 
> anyways.
>     >     >     >
>     >     >     >     However, I noticed Ceph is remapping all PGs, and 
> actively moving data. I presume now this is done for two reasons:
>     >     >     >     - The remapping is needed since the primary OSD might 
> be the one which went down. But for remapping (I guess) there's no need to 
> actually move data,
>     >     >     >       or is there?
>     >     >     >     - The data movement is done to have the "k" shards 
> available.
>     >     >     >     If it's really the case that "all shards are equal", 
> then data movement should not occur - or is this a bug / bad feature?
>     >     >     >
>     >     >     >
>     >     >     > If you lose one OSD out of a host, Ceph is going to try and 
> re-replicate the data onto the other OSDs in that host. Your PG size and the 
> CRUSH rule instructs it that the PG needs 6 different OSDs, and those OSDs 
> need to be placed on different hosts.
>     >     >     >
>     >     >     > You're right that gets very funny if your PG size is equal 
> to the number of hosts. We generally discourage people from running 
> configurations like that.
>     >     >
>     >     >     Yes. k=4 with m=2 with 6 hosts (i.e. possibility to lose 2 
> hosts) would be our starting point - since we may add more hosts later (not 
> too soon-ish, but it's not excluded more may come in a year or so),
>     >     >     and migrating large EC pools to different settings still 
> seems a bit messy.
>     >     >     We can't really afford to reduce available storage 
> significantly more in the current setup, and would like to have the 
> possibility to lose one host (for example for an OS upgrade),
>     >     >     and then still lose a few disks in case they fail with bad 
> timing.
>     >     >
>     >     >     >
>     >     >     > Or if you mean that you are losing a host, and the data is 
> shuffling around on the remaining hosts...hrm, that'd be weird. (Perhaps a 
> result of EC pools' "indep" rather than "firstn" crush rules?)
>     >     >
>     >     >     They are indep, which I think is the default (no manual 
> editing done). I thought the main goal of indep was exactly to reduce data 
> movement.
>     >     >     Indeed, it's very funny that data is moved, it certainly does 
> not help to increase redundancy ;-).
>     >     >
>     >     <snip>
>     >     >
>     >     > Can you also share the output of "ceph osd crush dump"?
>     >
>     >     Attached.
>     >
>     >
>     > Yep, that all looks simple enough.
>     >
>     > Do you have any "ceph -s" or other records from when this was 
> occurring? Is it actually deleting or migrating any of the existing shards, 
> or is it just that the shards which were previously on the out'ed OSDs are 
> now getting copied onto the remaining ones?
>     >
>     > I think I finally understand what's happening here but would like to be 
> sure. :)
>     > -Greg
>     >
>     > (In short: certain straws were previously mapping onto osd.[outed], but 
> now they map onto the remaining OSDs. Because everything's independent, the 
> actual CRUSH mapping for any shard other than the last is now going to map 
> onto a remaining OSD, which would displace the shard it already holds. But 
> the previously-present shard is going to remain "remapped" there because it 
> can't map successfully. So if you lose osd.5, you'll go from a CRUSH mapping 
> like [1,3,5,0,2,4] to [1,3,4,0,2,UNMAPPED], but in reality shards 2 and 5 
> will both be on OSD 4.)
> 
>     Interesting! This would also mean that space usage on the 
> remaining-active OSDs would increase by 1/6 in our setup, which is 
> significant.
>     So that's another good reason to use mon_osd_down_out_subtree_limit=host 
> or to just set "ceph osd set noout" when actively reinstalling a host.
> 
>     I reproduced just now. Here's what I see (ignore the inconsistent PG, 
> that's unrelated and likely a cause of previous OSD OOM issues):
>     # ceph -s
>       cluster:
>         id:     69b1fbe5-f084-4410-a99a-ab57417e7846
>         health: HEALTH_ERR
>                 41569430/513248666 objects misplaced (8.099%)
>                 1 scrub errors
>                 Possible data damage: 1 pg inconsistent
>                 Degraded data redundancy: 105575103/513248666 objects 
> degraded (20.570%), 2176 pgs degraded, 985 pgs undersized
> 
>       services:
>         mon: 3 daemons, quorum mon003,mon001,mon002
>         mgr: mon002(active), standbys: mon001, mon003
>         mds: cephfs_baf-1/1/1 up  {0=mon002=up:active}, 1 up:standby-replay, 
> 1 up:standby
>         osd: 196 osds: 164 up, 164 in; 1166 remapped pgs
> 
>       data:
>         pools:   2 pools, 2176 pgs
>         objects: 89370k objects, 4488 GB
>         usage:   29546 GB used, 555 TB / 584 TB avail
>         pgs:     105575103/513248666 objects degraded (20.570%)
>                  41569430/513248666 objects misplaced (8.099%)
>                  1166 active+undersized+degraded+remapped+backfilling
>                  1009 active+undersized+degraded
>                  1    active+undersized+degraded+inconsistent
> 
>       io:
>         client:   6784 kB/s rd, 6820 kB/s wr, 804 op/s rd, 1174 op/s wr
>         recovery: 79333 kB/s, 27 keys/s, 1080 objects/s
> 
>     In ceph health detail, I see:
>         pg 2.7cd is active+undersized+degraded+remapped+backfilling, acting 
> [91,63,33,163,2147483647 <tel:(214)%20748-3647>,103]
>         pg 2.7ce is stuck undersized for 114.063431, current state 
> active+undersized+degraded+remapped+backfilling, last acting 
> [31,121,157,2147483647 <tel:(214)%20748-3647>,61,87]
>         pg 2.7cf is stuck undersized for 110.842287, current state 
> active+undersized+degraded+remapped+backfilling, last acting 
> [163,36,2147483647 <tel:(214)%20748-3647>,21,124,69]
>         pg 2.7d0 is stuck undersized for 118.876276, current state 
> active+undersized+degraded+remapped+backfilling, last acting 
> [140,91,66,22,2147483647 <tel:(214)%20748-3647>,112]
>         pg 2.7d1 is stuck undersized for 388.377010, current state 
> active+undersized+degraded, last acting [62,110,2147483647 
> <tel:(214)%20748-3647>,31,141,81]
>         pg 2.7d2 is stuck undersized for 111.265718, current state 
> active+undersized+degraded+remapped+backfilling, last acting 
> [54,125,2147483647 <tel:(214)%20748-3647>,157,88,21]
>         pg 2.7d3 is stuck undersized for 105.885607, current state 
> active+undersized+degraded+remapped+backfilling, last acting 
> [20,117,96,2147483647 <tel:(214)%20748-3647>,144,54]
>         pg 2.7d4 is stuck undersized for 112.693680, current state 
> active+undersized+degraded+remapped+backfilling, last acting 
> [105,145,71,60,2147483647 <tel:(214)%20748-3647>,13]
>         pg 2.7d5 is stuck undersized for 388.337919, current state 
> active+undersized+degraded, last acting [142,90,19,60,2147483647,127]
>     [...]
>     While I saw, when the host's OSDs were only down, but still in:
>         pg 2.7cd is active+undersized+degraded, acting 
> [91,63,33,163,2147483647 <tel:(214)%20748-3647>,103]
>         pg 2.7ce is stuck undersized for 145.507311, current state 
> active+undersized+degraded, last acting [31,121,157,2147483647 
> <tel:(214)%20748-3647>,61,87]
>         pg 2.7cf is stuck undersized for 143.293067, current state 
> active+undersized+degraded, last acting [163,36,2147483647 
> <tel:(214)%20748-3647>,21,124,69]
>         pg 2.7d0 is stuck undersized for 145.461503, current state 
> active+undersized+degraded, last acting [140,91,66,22,2147483647,112]
>         pg 2.7d1 is stuck undersized for 145.496089, current state 
> active+undersized+degraded, last acting [62,110,2147483647 
> <tel:(214)%20748-3647>,31,141,81]
>         pg 2.7d2 is stuck undersized for 145.513296, current state 
> active+undersized+degraded, last acting [54,125,2147483647 
> <tel:(214)%20748-3647>,157,88,21]
>         pg 2.7d3 is stuck undersized for 145.503361, current state 
> active+undersized+degraded, last acting [20,117,96,2147483647,144,54]
>         pg 2.7d4 is stuck undersized for 145.484259, current state 
> active+undersized+degraded, last acting [105,145,71,60,2147483647,13]
>         pg 2.7d5 is stuck undersized for 145.456998, current state 
> active+undersized+degraded, last acting [142,90,19,60,2147483647,127]
> 
>     Does this match expectations?
> 
> 
> Can you get the output of eg "ceph pg 2.7cd query"? Want to make sure the 
> backfilling versus acting sets and things are correct.


(resent with attachments compressed to make listserver happy)

You'll find attached:
query_allwell)  Output of "ceph pg 2.7cd query" when all OSDs are up and 
everything is healthy. 
query_one_host_out) Output of "ceph pg 2.7cd query" when OSDs 164-195 (one 
host) are down and out. 

Cheers,
        Oliver

query_one_host_out.gz
Description: application/gzip

query_allwell.gz
Description: application/gzip

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] fast_read in EC pools

Reply via email to