Am 26.02.2018 um 19:45 schrieb Gregory Farnum: > On Mon, Feb 26, 2018 at 10:35 AM Oliver Freyermuth > <[email protected] <mailto:[email protected]>> wrote: > > Am 26.02.2018 um 19:24 schrieb Gregory Farnum: > > I don’t actually know this option, but based on your results it’s clear > that “fast read” is telling the OSD it should issue reads to all k+m OSDs > storing data and then reconstruct the data from the first k to respond. > Without the fast read it simply asks the regular k data nodes to read it back > straight and sends the reply back. This is a straight trade off of more > bandwidth for lower long-tail latencies. > > -Greg > > Many thanks, this certainly explains it! > Apparently I misunderstood how "normal" read works - I thought that in > any case, all shards would be requested, and the primary OSD would check EC > is still fine. > > > Nope, EC PGs can self-validate (they checksum everything) and so extra shards > are requested only if one of the OSDs has an error. > > > However, with the explanation that indeed only the actual "k" shards are > read in the "normal" case, it's fully clear to me that "fast_read" will be > slower for us, > since we are limited by network bandwidth. > > On a side-note, activating fast_read also appears to increase CPU load a > bit, which is then probably due to the EC calculations that need to be > performed if the "wrong" > shards arrived at the primary OSD first. > > I believe this also explains why an EC pool actually does remapping in a > k=4 m=2 pool with failure domain host if one of 6 hosts goes down: > Namely, to have the "k" shards available on the "up" OSDs. This answers > an earlier question of mine. > > > I don't quite understand what you're asking/saying here, but if an OSD gets > marked out all the PGs that used to rely on it will get another OSD unless > you've instructed the cluster not to do so. The specifics of any given > erasure code have nothing to do with it. :) > -Greg
Ah, sorry, let me clarify.
The EC pool I am considering is k=4 m=2 with failure domain host, on 6 hosts.
So necessarily, there is one shard for each host. If one host goes down for a
prolonged time,
there's no "logical" advantage of redistributing things - since whatever you
do, with 5 hosts, all PGs will stay in degraded state anyways.
However, I noticed Ceph is remapping all PGs, and actively moving data. I
presume now this is done for two reasons:
- The remapping is needed since the primary OSD might be the one which went
down. But for remapping (I guess) there's no need to actually move data,
or is there?
- The data movement is done to have the "k" shards available.
If it's really the case that "all shards are equal", then data movement should
not occur - or is this a bug / bad feature?
Cheers,
Oliver
>
>
>
> Many thanks for clearing this up!
>
> Cheers,
> Oliver
>
> > On Mon, Feb 26, 2018 at 3:57 AM Oliver Freyermuth
> <[email protected] <mailto:[email protected]>
> <mailto:[email protected]
> <mailto:[email protected]>>> wrote:
> >
> > Some additional information gathered from our monitoring:
> > It seems fast_read does indeed become active immediately, but I do
> not understand the effect.
> >
> > With fast_read = 0, we see:
> > ~ 5.2 GB/s total outgoing traffic from all 6 OSD hosts
> > ~ 2.3 GB/s total incoming traffic to all 6 OSD hosts
> >
> > With fast_read = 1, we see:
> > ~ 5.1 GB/s total outgoing traffic from all 6 OSD hosts
> > ~ 3 GB/s total incoming traffic to all 6 OSD hosts
> >
> > I would have expected exactly the contrary to happen...
> >
> > Cheers,
> > Oliver
> >
> > Am 26.02.2018 um 12:51 schrieb Oliver Freyermuth:
> > > Dear Cephalopodians,
> > >
> > > in the few remaining days when we can still play at our will with
> parameters,
> > > we just now tried to set:
> > > ceph osd pool set cephfs_data fast_read 1
> > > but did not notice any effect on sequential, large file read
> throughput on our k=4 m=2 EC pool.
> > >
> > > Should this become active immediately? Or do OSDs need a restart
> first?
> > > Is the option already deemed safe?
> > >
> > > Or is it just that we should not expect any change on throughput,
> since our system (for large sequential reads)
> > > is purely limited by the IPoIB throughput, and the shards are
> nevertheless requested by the primary OSD?
> > > So the gain would not be in throughput, but the reply to the
> client would be slightly faster (before all shards have arrived)?
> > > Then this option would be mainly of interest if the disk IO was
> congested (which does not happen for us as of yet)
> > > and not help so much if the system is limited by network
> bandwidth.
> > >
> > > Cheers,
> > > Oliver
> > >
> > >
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > [email protected] <mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]>>
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > [email protected] <mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]>>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
