On Mon, Feb 26, 2018 at 11:33 AM Oliver Freyermuth < [email protected]> wrote:
> Am 26.02.2018 um 20:23 schrieb Gregory Farnum: > > > > > > On Mon, Feb 26, 2018 at 11:06 AM Oliver Freyermuth < > [email protected] <mailto:[email protected]>> > wrote: > > > > Am 26.02.2018 um 19:45 schrieb Gregory Farnum: > > > On Mon, Feb 26, 2018 at 10:35 AM Oliver Freyermuth < > [email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto: > [email protected]>>> wrote: > > > > > > Am 26.02.2018 um 19:24 schrieb Gregory Farnum: > > > > I don’t actually know this option, but based on your results > it’s clear that “fast read” is telling the OSD it should issue reads to all > k+m OSDs storing data and then reconstruct the data from the first k to > respond. Without the fast read it simply asks the regular k data nodes to > read it back straight and sends the reply back. This is a straight trade > off of more bandwidth for lower long-tail latencies. > > > > -Greg > > > > > > Many thanks, this certainly explains it! > > > Apparently I misunderstood how "normal" read works - I thought > that in any case, all shards would be requested, and the primary OSD would > check EC is still fine. > > > > > > > > > Nope, EC PGs can self-validate (they checksum everything) and so > extra shards are requested only if one of the OSDs has an error. > > > > > > > > > However, with the explanation that indeed only the actual "k" > shards are read in the "normal" case, it's fully clear to me that > "fast_read" will be slower for us, > > > since we are limited by network bandwidth. > > > > > > On a side-note, activating fast_read also appears to increase > CPU load a bit, which is then probably due to the EC calculations that need > to be performed if the "wrong" > > > shards arrived at the primary OSD first. > > > > > > I believe this also explains why an EC pool actually does > remapping in a k=4 m=2 pool with failure domain host if one of 6 hosts goes > down: > > > Namely, to have the "k" shards available on the "up" OSDs. > This answers an earlier question of mine. > > > > > > > > > I don't quite understand what you're asking/saying here, but if an > OSD gets marked out all the PGs that used to rely on it will get another > OSD unless you've instructed the cluster not to do so. The specifics of any > given erasure code have nothing to do with it. :) > > > -Greg > > > > Ah, sorry, let me clarify. > > The EC pool I am considering is k=4 m=2 with failure domain host, on > 6 hosts. > > So necessarily, there is one shard for each host. If one host goes > down for a prolonged time, > > there's no "logical" advantage of redistributing things - since > whatever you do, with 5 hosts, all PGs will stay in degraded state anyways. > > > > However, I noticed Ceph is remapping all PGs, and actively moving > data. I presume now this is done for two reasons: > > - The remapping is needed since the primary OSD might be the one > which went down. But for remapping (I guess) there's no need to actually > move data, > > or is there? > > - The data movement is done to have the "k" shards available. > > If it's really the case that "all shards are equal", then data > movement should not occur - or is this a bug / bad feature? > > > > > > If you lose one OSD out of a host, Ceph is going to try and re-replicate > the data onto the other OSDs in that host. Your PG size and the CRUSH rule > instructs it that the PG needs 6 different OSDs, and those OSDs need to be > placed on different hosts. > > > > You're right that gets very funny if your PG size is equal to the number > of hosts. We generally discourage people from running configurations like > that. > > Yes. k=4 with m=2 with 6 hosts (i.e. possibility to lose 2 hosts) would be > our starting point - since we may add more hosts later (not too soon-ish, > but it's not excluded more may come in a year or so), > and migrating large EC pools to different settings still seems a bit messy. > We can't really afford to reduce available storage significantly more in > the current setup, and would like to have the possibility to lose one host > (for example for an OS upgrade), > and then still lose a few disks in case they fail with bad timing. > > > > > Or if you mean that you are losing a host, and the data is shuffling > around on the remaining hosts...hrm, that'd be weird. (Perhaps a result of > EC pools' "indep" rather than "firstn" crush rules?) > > They are indep, which I think is the default (no manual editing done). I > thought the main goal of indep was exactly to reduce data movement. > Indeed, it's very funny that data is moved, it certainly does not help to > increase redundancy ;-). Given that you're stuck in that state, you probably want to set the mon_osd_down_out_subtree_limit so that it doesn't mark out a whole host. Can you also share the output of "ceph osd crush dump"? -Greg
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
