On Mon, Aug 24, 2015 at 5:43 PM, Ilya Dryomov <[email protected]> wrote:
> On Sun, Aug 23, 2015 at 10:23 PM, Nick Fisk <[email protected]> wrote:
>>> -----Original Message-----
>>> From: Ilya Dryomov [mailto:[email protected]]
>>> Sent: 23 August 2015 18:33
>>> To: Nick Fisk <[email protected]>
>>> Cc: Ceph Development <[email protected]>
>>> Subject: Re: Kernel RBD Readahead
>>>
>>> On Sat, Aug 22, 2015 at 11:45 PM, Nick Fisk <[email protected]> wrote:
>>> > Hi Ilya,
>>> >
>>> > I was wondering if I could just get your thoughts on a matter I have
>>> > run into?
>>> >
>>> > Its surrounding read performance of the RBD kernel client and blk-mq,
>>> > mainly when doing large single threaded reads. During testing
>>> > performance seems to be limited to around 40MB/s, which is probably
>>> > fairly similar to what you would expect to get from a single OSD. This
>>> > is to be expected as an RBD is just a long chain of objects each on a
>>> > different OSD which is being read through in order one at a time.
>>> >
>>> > In theory readahead should make up for this by making the RBD client
>>> > read from several OSD’s ahead of the current required block. However
>>> > from what I can see it seems that setting a readahead value higher
>>> > than max_sectors_kb doesn’t appear to have any effect, meaning that
>>> > readahead is limited to each object that is currently being read.
>>> > Would you be able to confirm if this is correct and if this is by design?
>>>
>>> [CCing ceph-devel]
>>>
>>> Certainly not by design.  rbd is just a block device driver, so if the 
>>> kernel
>>> submits a readahead read, it will obey and carry it out in full.
>>> The readahead is driven by the VM in pages, it doesn't care about rbd object
>>> boundaries and such.
>>>
>>> That said, one problem is in the VM subsystem, where readaheads get
>>> capped at 512 pages (= 2M).  If you do a simple single threaded read test,
>>> you'll see 4096 sector (= 2M) I/Os instead of object size I/Os:
>>>
>>>     $ rbd info foo | grep order
>>>             order 24 (16384 kB objects)
>>>     $ blockdev --getra /dev/rbd0
>>>     32768
>>>     $ dd if=/dev/rbd0 of=/dev/null bs=32M
>>>     # avgrq-sz is 4096.00
>>>
>>> This was introduced in commit 6d2be915e589 ("mm/readahead.c: fix
>>> readahead failure for memoryless NUMA nodes and limit readahead pages")
>>> [1], which went into 3.15.  The hard limit was Linus' suggestion, 
>>> apparently.
>>>
>>> #define MAX_READAHEAD   ((512*4096)/PAGE_CACHE_SIZE)
>>> /*
>>>  * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
>>>  * sensible upper limit.
>>>  */
>>> unsigned long max_sane_readahead(unsigned long nr) {
>>>         return min(nr, MAX_READAHEAD);
>>> }
>>>
>>> This limit used to be dynamic and depended on the number of free pages in
>>> the system.  There has been an attempt to bring that behaviour back [2], but
>>> it didn't go very far as far as getting into mainline.  It looks like Red 
>>> Hat and
>>> Oracle are shipping [2] in some of their kernels though.  If you apply it, 
>>> you'll
>>> see 32768 sector (= 16M) I/Os in the above test, which is how it should be.
>>>
>>> [1]
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=6d
>>> 2be915e589b58cb11418cbe1f22ff90732b6ac
>>> [2] http://thread.gmane.org/gmane.linux.kernel/1893680
>>>
>>> One thing we should be doing is setting read_ahead_kb to object size, the
>>> default 128k doesn't really cut it for rbd.  I'll send a patch for that.
>>>
>>> Thanks,
>>>
>>>                 Ilya
>>
>>
>> Thanks for your response.
>>
>> I do see the IO's being limited to 4096 sectors in the 4.1 kernel and so 
>> that is likely to be partly the cause of the poor performance I am seeing. 
>> However I tried a 3.14 kernel and saw the same level performance, but this 
>> time the IO's were limited to 1024 sectors. The queue depth was at around 8 
>> so I guess this means its submitting 8*512kb IO's up to the max_sectors_kb 
>> limit of 4096KB. From the OSD point of view, this will still be accessing 1 
>> OSD at a time.
>>
>> Maybe I'm expecting the wrong results but I was expecting that either 1 of 
>> these 2 scenarios would happen.
>>
>> 1. Kernel submits a large enough IO to satisfy the readahead value, 
>> max_sectors_kb would need to be higher than the object size (currently not 
>> possible) and the RADOS layer would be responsible for doing the parallel 
>> reads to the OSD's to satisfy it.
>>
>> 2. Kernel recognises that the readahead is bigger than the max_sectors_kb 
>> value and submits several IO's in parallel to the RBD device to satisfy the 
>> readahead request. Ie 32MB readahead would submit 8x4MB IO's in parallel.
>>
>> Please let me know if I have got the wrong idea here. But in my head either 
>> solution should improve sequential reads by a large amount, with the 2nd 
>> possibly slightly better as you are only waiting on the 1st OSD to respond 
>> to complete the request.
>>
>> Thanks for including the Ceph-Devel list, unfortunately despite several 
>> attempts I have not been able to post to this list after subscribing, please 
>> can you forward any correspondence if you think it would be useful to share.
>
> Did you remember to set max_sectors_kb to max_hw_sectors_kb?  The block
> layer in 3.14 leaves max_sectors_kb at 512, even when max_hw_sectors_kb
> is set to a much bigger value by the driver.  If you adjust it, you
> should be able to see object size requests, at least sometimes.  Note
> that you definitely won't see them all the time due to max_segments
> limitation, which was lifted only recently.

I just realized that what I wrote is true for O_DIRECT reads.  For page
cache driven reads, which is what we are discussing, the max_segments
limitation is killer - 128 pages = 512k.  The fix was a one-liner, but
I don't think it was submitted for stable.

The other thing is that 3.14.21+ kernels are just as screwed readahead
wise as 3.15+, as the offending commit was backported.  So even if
I submit the max_segments one-liner to stable and it makes it to say
3.14.52, we will still get only 4096 sector page cache I/Os, just like
you got on 4.1.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to