Re: [ceph-users] fast_read in EC pools

Oliver Freyermuth Mon, 26 Feb 2018 11:06:52 -0800

Am 26.02.2018 um 19:45 schrieb Gregory Farnum:
> On Mon, Feb 26, 2018 at 10:35 AM Oliver Freyermuth 
> <[email protected] <mailto:[email protected]>> wrote:
> 
>     Am 26.02.2018 um 19:24 schrieb Gregory Farnum:
>     > I don’t actually know this option, but based on your results it’s clear 
> that “fast read” is telling the OSD it should issue reads to all k+m OSDs 
> storing data and then reconstruct the data from the first k to respond. 
> Without the fast read it simply asks the regular k data nodes to read it back 
> straight and sends the reply back. This is a straight trade off of more 
> bandwidth for lower long-tail latencies.
>     > -Greg
> 
>     Many thanks, this certainly explains it!
>     Apparently I misunderstood how "normal" read works - I thought that in 
> any case, all shards would be requested, and the primary OSD would check EC 
> is still fine.
> 
> 
> Nope, EC PGs can self-validate (they checksum everything) and so extra shards 
> are requested only if one of the OSDs has an error.
>  
> 
>     However, with the explanation that indeed only the actual "k" shards are 
> read in the "normal" case, it's fully clear to me that "fast_read" will be 
> slower for us,
>     since we are limited by network bandwidth.
> 
>     On a side-note, activating fast_read also appears to increase CPU load a 
> bit, which is then probably due to the EC calculations that need to be 
> performed if the "wrong"
>     shards arrived at the primary OSD first.
> 
>     I believe this also explains why an EC pool actually does remapping in a 
> k=4 m=2 pool with failure domain host if one of 6 hosts goes down:
>     Namely, to have the "k" shards available on the "up" OSDs. This answers 
> an earlier question of mine.
> 
> 
> I don't quite understand what you're asking/saying here, but if an OSD gets 
> marked out all the PGs that used to rely on it will get another OSD unless 
> you've instructed the cluster not to do so. The specifics of any given 
> erasure code have nothing to do with it. :)
> -Greg


Ah, sorry, let me clarify. 
The EC pool I am considering is k=4 m=2 with failure domain host, on 6 hosts. 
So necessarily, there is one shard for each host. If one host goes down for a 
prolonged time,
there's no "logical" advantage of redistributing things - since whatever you 
do, with 5 hosts, all PGs will stay in degraded state anyways. 

However, I noticed Ceph is remapping all PGs, and actively moving data. I 
presume now this is done for two reasons:
- The remapping is needed since the primary OSD might be the one which went 
down. But for remapping (I guess) there's no need to actually move data,
  or is there? 
- The data movement is done to have the "k" shards available. 
If it's really the case that "all shards are equal", then data movement should 
not occur - or is this a bug / bad feature? 

Cheers,
        Oliver

>  
> 
> 
>     Many thanks for clearing this up!
> 
>     Cheers,
>             Oliver
> 
>     > On Mon, Feb 26, 2018 at 3:57 AM Oliver Freyermuth 
> <[email protected] <mailto:[email protected]> 
> <mailto:[email protected] 
> <mailto:[email protected]>>> wrote:
>     >
>     >     Some additional information gathered from our monitoring:
>     >     It seems fast_read does indeed become active immediately, but I do 
> not understand the effect.
>     >
>     >     With fast_read = 0, we see:
>     >     ~ 5.2 GB/s total outgoing traffic from all 6 OSD hosts
>     >     ~ 2.3 GB/s total incoming traffic to all 6 OSD hosts
>     >
>     >     With fast_read = 1, we see:
>     >     ~ 5.1 GB/s total outgoing traffic from all 6 OSD hosts
>     >     ~ 3   GB/s total incoming traffic to all 6 OSD hosts
>     >
>     >     I would have expected exactly the contrary to happen...
>     >
>     >     Cheers,
>     >             Oliver
>     >
>     >     Am 26.02.2018 um 12:51 schrieb Oliver Freyermuth:
>     >     > Dear Cephalopodians,
>     >     >
>     >     > in the few remaining days when we can still play at our will with 
> parameters,
>     >     > we just now tried to set:
>     >     > ceph osd pool set cephfs_data fast_read 1
>     >     > but did not notice any effect on sequential, large file read 
> throughput on our k=4 m=2 EC pool.
>     >     >
>     >     > Should this become active immediately? Or do OSDs need a restart 
> first?
>     >     > Is the option already deemed safe?
>     >     >
>     >     > Or is it just that we should not expect any change on throughput, 
> since our system (for large sequential reads)
>     >     > is purely limited by the IPoIB throughput, and the shards are 
> nevertheless requested by the primary OSD?
>     >     > So the gain would not be in throughput, but the reply to the 
> client would be slightly faster (before all shards have arrived)?
>     >     > Then this option would be mainly of interest if the disk IO was 
> congested (which does not happen for us as of yet)
>     >     > and not help so much if the system is limited by network 
> bandwidth.
>     >     >
>     >     > Cheers,
>     >     >       Oliver
>     >     >
>     >     >
>     >     >
>     >     > _______________________________________________
>     >     > ceph-users mailing list
>     >     > [email protected] <mailto:[email protected]> 
> <mailto:[email protected] <mailto:[email protected]>>
>     >     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >     >
>     >
>     >
>     >     _______________________________________________
>     >     ceph-users mailing list
>     >     [email protected] <mailto:[email protected]> 
> <mailto:[email protected] <mailto:[email protected]>>
>     >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >
> 
>

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] fast_read in EC pools

Reply via email to