On Mon, Feb 26, 2018 at 11:33 AM Oliver Freyermuth <
[email protected]> wrote:

> Am 26.02.2018 um 20:23 schrieb Gregory Farnum:
> >
> >
> > On Mon, Feb 26, 2018 at 11:06 AM Oliver Freyermuth <
> [email protected] <mailto:[email protected]>>
> wrote:
> >
> >     Am 26.02.2018 um 19:45 schrieb Gregory Farnum:
> >     > On Mon, Feb 26, 2018 at 10:35 AM Oliver Freyermuth <
> [email protected] <mailto:[email protected]>
> <mailto:[email protected] <mailto:
> [email protected]>>> wrote:
> >     >
> >     >     Am 26.02.2018 um 19:24 schrieb Gregory Farnum:
> >     >     > I don’t actually know this option, but based on your results
> it’s clear that “fast read” is telling the OSD it should issue reads to all
> k+m OSDs storing data and then reconstruct the data from the first k to
> respond. Without the fast read it simply asks the regular k data nodes to
> read it back straight and sends the reply back. This is a straight trade
> off of more bandwidth for lower long-tail latencies.
> >     >     > -Greg
> >     >
> >     >     Many thanks, this certainly explains it!
> >     >     Apparently I misunderstood how "normal" read works - I thought
> that in any case, all shards would be requested, and the primary OSD would
> check EC is still fine.
> >     >
> >     >
> >     > Nope, EC PGs can self-validate (they checksum everything) and so
> extra shards are requested only if one of the OSDs has an error.
> >     >
> >     >
> >     >     However, with the explanation that indeed only the actual "k"
> shards are read in the "normal" case, it's fully clear to me that
> "fast_read" will be slower for us,
> >     >     since we are limited by network bandwidth.
> >     >
> >     >     On a side-note, activating fast_read also appears to increase
> CPU load a bit, which is then probably due to the EC calculations that need
> to be performed if the "wrong"
> >     >     shards arrived at the primary OSD first.
> >     >
> >     >     I believe this also explains why an EC pool actually does
> remapping in a k=4 m=2 pool with failure domain host if one of 6 hosts goes
> down:
> >     >     Namely, to have the "k" shards available on the "up" OSDs.
> This answers an earlier question of mine.
> >     >
> >     >
> >     > I don't quite understand what you're asking/saying here, but if an
> OSD gets marked out all the PGs that used to rely on it will get another
> OSD unless you've instructed the cluster not to do so. The specifics of any
> given erasure code have nothing to do with it. :)
> >     > -Greg
> >
> >     Ah, sorry, let me clarify.
> >     The EC pool I am considering is k=4 m=2 with failure domain host, on
> 6 hosts.
> >     So necessarily, there is one shard for each host. If one host goes
> down for a prolonged time,
> >     there's no "logical" advantage of redistributing things - since
> whatever you do, with 5 hosts, all PGs will stay in degraded state anyways.
> >
> >     However, I noticed Ceph is remapping all PGs, and actively moving
> data. I presume now this is done for two reasons:
> >     - The remapping is needed since the primary OSD might be the one
> which went down. But for remapping (I guess) there's no need to actually
> move data,
> >       or is there?
> >     - The data movement is done to have the "k" shards available.
> >     If it's really the case that "all shards are equal", then data
> movement should not occur - or is this a bug / bad feature?
> >
> >
> > If you lose one OSD out of a host, Ceph is going to try and re-replicate
> the data onto the other OSDs in that host. Your PG size and the CRUSH rule
> instructs it that the PG needs 6 different OSDs, and those OSDs need to be
> placed on different hosts.
> >
> > You're right that gets very funny if your PG size is equal to the number
> of hosts. We generally discourage people from running configurations like
> that.
>
> Yes. k=4 with m=2 with 6 hosts (i.e. possibility to lose 2 hosts) would be
> our starting point - since we may add more hosts later (not too soon-ish,
> but it's not excluded more may come in a year or so),
> and migrating large EC pools to different settings still seems a bit messy.
> We can't really afford to reduce available storage significantly more in
> the current setup, and would like to have the possibility to lose one host
> (for example for an OS upgrade),
> and then still lose a few disks in case they fail with bad timing.
>
> >
> > Or if you mean that you are losing a host, and the data is shuffling
> around on the remaining hosts...hrm, that'd be weird. (Perhaps a result of
> EC pools' "indep" rather than "firstn" crush rules?)
>
> They are indep, which I think is the default (no manual editing done). I
> thought the main goal of indep was exactly to reduce data movement.
> Indeed, it's very funny that data is moved, it certainly does not help to
> increase redundancy ;-).


Given that you're stuck in that state, you probably want to set
the mon_osd_down_out_subtree_limit so that it doesn't mark out a whole host.

Can you also share the output of "ceph osd crush dump"?
-Greg
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to