On Tue, Aug 25, 2015 at 10:40 AM, Nick Fisk <n...@fisk.me.uk> wrote:
> I have done two tests one with 1MB objects and another with 4MB objects, my 
> cluster is a little busier than when I did the quick test yesterday, so all 
> speeds are slightly down across the board but you can see the scaling effect 
> nicely. Results:
>
> 1MB Order RBD
> Readahead       DD->Null Speed
> 128             18MB/s
> 1024            32MB/s
> 2048            40MB/s
> 4096            58MB/s
> 8192            75MB/s
> 16384           91MB/s
> 32768           160MB/s
>
> 4MB Order RBD
> 128             42MB/s
> 1024            56MB/s
> 2048            61MB/s
> 4096            98MB/s
> 8192            121MB/s
> 16384           170MB/s
> 32768           195MB/s
> 65536           221MB/s
> 131072          271MB/s
>
> I think the results confirm my suspicions, where a full stripe in a raid 
> array will usually only be a couple of MB (eg 256kb chunk * 8 disks) and so a 
> relatively small readahead will involve all the disks for max performance. In 
> a Ceph RBD a full stripe will be 4MB * number of OSD's in the cluster. So I 
> think that if sequential read performance is the only goal, then readahead 
> probably needs to equal that figure, which could be massive. But in reality 
> like me you will probably find that  you get sufficient performance at a 
> lower value. Of course all this theory could all change when the kernel 
> client gets striping support.
>
> However in terms of a default, that’s a tricky one. Even setting it to 4096 
> would probably start to have a negative impact on pure random IO latency. 
> Each read would make a OSD read a whole 4MB object, see small table below for 
> IOP/Read size for disks in my cluster. I would imagine somewhere between 
> 256-1024 would be a good trade off between where the OSD disks latency starts 
> to rise. Users would need to be aware of their workload and tweak readahead 
> if needed.
>
>                         IOPs
> (4k Random Read)        83
> (64k Random Read)       81
> (256k Random Read)      73
> (1M Random Read)        52
> (4M Random Read)        25

Yeah, we want a sensible default, but it's always going to be a trade
off.  librbd has readahead knobs, but the only real use case there is
shortening qemu boot times, so we can't copy those settings.  I'll have
to think about it some more - it might make more sense to leave things
as is.  Users with large sequential read workloads should know to check
and adjust readahead settings, and likely won't be satisfied with 1x
object size anyway.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to