On Tue, Aug 25, 2015 at 5:05 PM, Nick Fisk <[email protected]> wrote:
>> -----Original Message-----
>> From: Ilya Dryomov [mailto:[email protected]]
>> Sent: 25 August 2015 09:45
>> To: Nick Fisk <[email protected]>
>> Cc: Ceph Development <[email protected]>
>> Subject: Re: Kernel RBD Readahead
>>
>> On Tue, Aug 25, 2015 at 10:40 AM, Nick Fisk <[email protected]> wrote:
>> > I have done two tests one with 1MB objects and another with 4MB objects,
>> my cluster is a little busier than when I did the quick test yesterday, so
>> all
>> speeds are slightly down across the board but you can see the scaling effect
>> nicely. Results:
>> >
>> > 1MB Order RBD
>> > Readahead DD->Null Speed
>> > 128 18MB/s
>> > 1024 32MB/s
>> > 2048 40MB/s
>> > 4096 58MB/s
>> > 8192 75MB/s
>> > 16384 91MB/s
>> > 32768 160MB/s
>> >
>> > 4MB Order RBD
>> > 128 42MB/s
>> > 1024 56MB/s
>> > 2048 61MB/s
>> > 4096 98MB/s
>> > 8192 121MB/s
>> > 16384 170MB/s
>> > 32768 195MB/s
>> > 65536 221MB/s
>> > 131072 271MB/s
>> >
>> > I think the results confirm my suspicions, where a full stripe in a raid
>> > array
>> will usually only be a couple of MB (eg 256kb chunk * 8 disks) and so a
>> relatively small readahead will involve all the disks for max performance.
>> In a
>> Ceph RBD a full stripe will be 4MB * number of OSD's in the cluster. So I
>> think
>> that if sequential read performance is the only goal, then readahead
>> probably needs to equal that figure, which could be massive. But in reality
>> like me you will probably find that you get sufficient performance at a
>> lower
>> value. Of course all this theory could all change when the kernel client gets
>> striping support.
>> >
>> > However in terms of a default, that’s a tricky one. Even setting it to 4096
>> would probably start to have a negative impact on pure random IO latency.
>> Each read would make a OSD read a whole 4MB object, see small table below
>> for IOP/Read size for disks in my cluster. I would imagine somewhere
>> between 256-1024 would be a good trade off between where the OSD disks
>> latency starts to rise. Users would need to be aware of their workload and
>> tweak readahead if needed.
>> >
>> > IOPs
>> > (4k Random Read) 83
>> > (64k Random Read) 81
>> > (256k Random Read) 73
>> > (1M Random Read) 52
>> > (4M Random Read) 25
>>
>> Yeah, we want a sensible default, but it's always going to be a trade off.
>> librbd has readahead knobs, but the only real use case there is shortening
>> qemu boot times, so we can't copy those settings. I'll have to think about
>> it
>> some more - it might make more sense to leave things as is. Users with large
>> sequential read workloads should know to check and adjust readahead
>> settings, and likely won't be satisfied with 1x object size anyway.
>
> Ok. I might try and create a 4.1 kernel with the blk-mq queue depth/IO size +
> readahead +max_segments fixes in as I'm think the TCP_NODELAY bug will still
> be present in my old 3.14 kernel.
I can build 4.2-rc8 + readahead patch on gitbuilders for you.
Thanks,
Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html