On Fri, Oct 3, 2014 at 1:39 AM, Somnath Roy <[email protected]> wrote:
> Please share your opinion on this..
>
> -----Original Message-----
> From: Sage Weil [mailto:[email protected]]
> Sent: Wednesday, October 01, 2014 3:57 PM
> To: Somnath Roy
> Cc: Mark Nelson; Kasper Dieter; Andreas Bluemle; Paul Von-Stamwitz
> Subject: RE: Weekly Ceph Performance Meeting Invitation
>
> On Wed, 1 Oct 2014, Somnath Roy wrote:
>> Yes Sage, it's all read..Each call to lfn_open() will incur this
>> lookup in case of FDCache miss (which will be in 99% of cases).
>> The following patch will certainly help the write path (which is
>> exciting!)  but not read as read is not through the transaction path.
>> My understanding is in the read path per io only two calls are going
>> to filestore , one xattr ("_") and followed by read to the same
>> object. If somehow, we can club (or something) this two requests,
>> reads will be benefitted. I did some prototype earlier by passing the
>> fd (and path) to the replicated pg during getattr call and pass the
>> same fd/path during next read. This improving performance as well as
>> cpu usage. But, this is against the objectstore interface logic :-(
>> Basically, sole purpose of FDCache for serving this kind of scenario
>> but since it is sharded based on object hash now (and FDCache itself
>> is cpu
>> intensive) it is not helping much. May be sharding based on PG
>> (Col_id) could help here ?
>
> I suspect a more fruitful approach would be to make a read-side handle-based 
> API for objectstore... so you can 'open' an object, keep that handle to the 
> ObjectContext, and then do subsequent read operations against that.
>
> Sharding the FDCache per PG would help with lock contention, yes, but is that 
> the limiter or are we burning CPU?
>
>> Also, I don't think ceph io path is very memory intensive and we can
>> leverage some memory for cache usage. For example, if we can have a
>> object_context cache at Replicated PG level (now the cache is there
>> but the contexts are not persisted), the performance (and cpu
>> usage)will be improved dramatically. I know that there can be lot of
>> PGs and thus memory usage can be a challenge. But, certainly we can
>> control that by limiting per cache size and what not. What could be
>> the size of an object_context instance, shouldn't be much I guess. I
>> did some prototyping on that too and got significant improvement. This
>> will eliminate the getattr path in case of cache hit.
>
> Can you propose this on ceph-devel?  I think this is promising.  And probably 
> quite easy to implement.

Yes, we have done this impl and it can help reduce nearly 100us for
each IO if hit cache. We will make a pull request next week. :-)

>
>> Another challenge for read ( and for write too probably)  is the
>> sequential io in case of rbd . With the Linux default read_ahead ,
>> performance of sequential read is significantly less than random read
>> with latest code in case of io_size say 64K. The obvious reason is
>> that with rbd, the default object size being 4MB, lot of sequential
>> 64K reads are coming to same PG and getting bottlenecked there.
>> Increasing read_ahead size improving performance but that will have an
>> effect in random workload. I think PG level cache should help here.
>> Striped images from librbd will not be facing this problem I guess but
>> krbd is not supporting striping and it is definitely a problem there.
>
> I still think the key here is a comprehensive set of IO hints.  Then it's a 
> problem of making sure we are using them effectively...
>
>> We can discuss these in next meeting if this sounds interesting.
>
> Yeah, but let's discuss on list first, no reason to wait!
>
> s
>
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:[email protected]]
>> Sent: Wednesday, October 01, 2014 1:14 PM
>> To: Somnath Roy
>> Cc: Mark Nelson; Kasper Dieter; Andreas Bluemle; Paul Von-Stamwitz
>> Subject: RE: Weekly Ceph Performance Meeting Invitation
>>
>> On Wed, 1 Oct 2014, Somnath Roy wrote:
>> > CPU wise the following are still hurting us in Giant. Lot of fixes
>> > like IndexManager stuff went in Giant that helped cpu consumption
>> > wise as well.
>> >
>> > 1. LFNIndex lookup logic . I have a fix that will save around one
>> > cpu core on that path. I am yet to address comments made by Greg/Sam
>> > on that. But, lot of improvement can happen here.
>>
>> Have you looked at
>>
>>
>> https://github.com/ceph/ceph/commit/74b1cf8bf1a7a160e6ce14603df63a46b2
>> 2d8b98
>>
>> The patch is incomplete, but with that change we should be able to drop to a 
>> single path lookup per ObjectStore::Transaction (as opposed to one for each 
>> op in the transaction that touches the given object).  I'm not sure if you 
>> were looking at ops that had a lot of those or they were simple single-io 
>> type operations?  That would only help on the write path; I think you said 
>> you've been focusing in reads.
>>
>> > 2. Buffer class is very cpu intensive. Fixing that part will be
>> > helping every ceph components.
>>
>> +1
>>
>> sage
>>
>> ________________________________
>>
>> PLEASE NOTE: The information contained in this electronic mail message is 
>> intended only for the use of the designated recipient(s) named above. If the 
>> reader of this message is not the intended recipient, you are hereby 
>> notified that you have received this message in error and that any review, 
>> dissemination, distribution, or copying of this message is strictly 
>> prohibited. If you have received this communication in error, please notify 
>> the sender by telephone or e-mail (as shown above) immediately and destroy 
>> any and all copies of this message in your possession (whether hard copies 
>> or electronically stored copies).
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to