Hi Sushma,

Thanks for your investigations! We already noticed the serializing
risk on GenericObjectMap/DBObjectMap. In order to improve performance
we add header cache to DBObjectMap.

As for KeyValueStore, a cache branch is on the reviewing, it can
greatly reduce lookup_header calls. Of course, replace with RWLock is
a good suggestion, I would like to try to estimate!

On Tue, Jul 1, 2014 at 8:39 AM, Sushma Gurram <[email protected]> wrote:
> Hi Haomai/Greg,
>
> I tried to analyze this a bit more and it appears that the 
> GenericObjectMap::header_lock is serializing the READ requests in the 
> following path and hence the low performance numbers with KeyValueStore.
> ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() -> 
> ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr() -> 
> KeyValueStore::getattr() -> GenericObjectMap::get_values() -> 
> GenericObjectMap::lookup_header()
>
> I fabricated the code to avoid this lock for a specific run and noticed that 
> the performance is similar to FileStore.
>
> In our earlier investigations also we noticed similar serialization issues 
> with DBObjectMap::header_lock when rgw stores xattrs in LevelDB.
>
> Can you please help understand the reason for this lock and whether it can be 
> replaced with a RWLock or any other suggestions to avoid serialization due to 
> this lock?
>
> Thanks,
> Sushma
>
> -----Original Message-----
> From: Haomai Wang [mailto:[email protected]]
> Sent: Friday, June 27, 2014 1:08 AM
> To: Sushma Gurram
> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; 
> [email protected]
> Subject: Re: [RFC] add rocksdb support
>
> As I mentioned days ago:
>
> There exists two points related kvstore perf:
> 1. The order of image and the strip
> size are important to performance. Because the header like inode in fs is 
> much lightweight than fd, so the order of image is expected to be lower. And 
> strip size can be configurated to 4kb to improve large io performance.
> 2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, 
> the header cache is important to perf. It's just like fdcahce in FileStore.
>
> As for detail perf number, I think this result based on master branch is 
> nearly correct. When strip-size and header cache are ready, I think it will 
> be better.
>
> On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <[email protected]> 
> wrote:
>> Delivery failure due to table format. Resending as plain text.
>>
>> _____________________________________________
>> From: Sushma Gurram
>> Sent: Thursday, June 26, 2014 5:35 PM
>> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
>> Cc: 'Zhang, Jian'; [email protected]
>> Subject: RE: [RFC] add rocksdb support
>>
>>
>> Hi Xinxin,
>>
>> Thanks for providing the results of the performance tests.
>>
>> I used fio (with support for rbd ioengine) to compare XFS and RockDB with a 
>> single OSD. Also confirmed with rados bench and both numbers seem to be of 
>> the same order.
>> My findings show that XFS is better than rocksdb. Can you please let us know 
>> rocksdb configuration that you used, object size and duration of run for 
>> rados bench?
>> For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer 
>> (%CPU of this thread is 50, while that of all other threads in the OSD is 
>> <10% utilized).
>> Is there a ceph.conf config option to configure the background threads in 
>> rocksdb?
>>
>> We ran our tests with following configuration:
>> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical cores),
>> HT disabled, 16 GB memory
>>
>> rocksdb configuration has been set to the following values in ceph.conf.
>>         rocksdb_write_buffer_size = 4194304
>>         rocksdb_cache_size = 4194304
>>         rocksdb_bloom_size = 0
>>         rocksdb_max_open_files = 10240
>>         rocksdb_compression = false
>>         rocksdb_paranoid = false
>>         rocksdb_log = /dev/null
>>         rocksdb_compact_on_mount = false
>>
>> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, 
>> iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) 
>> client connections to the OSD, thus stressing the OSD.
>>
>> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
>> -------------------------------------------------------------------
>> IO Pattern      XFS (IOPs)      Rocksdb (IOPs)
>> 4K writes       ~1450           ~670
>> 4K reads        ~65000          ~2000
>> 64K writes      ~431            ~57
>> 64K reads       ~17500          ~180
>>
>>
>> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
>> -------------------------------------------------------------------
>> IO Pattern      XFS (IOPs)      Rocksdb (IOPs)
>> 4K writes       ~1450           ~962
>> 4K reads        ~65000          ~1641
>> 64K writes      ~431            ~426
>> 64K reads       ~17500          ~209
>>
>> I guess theoretically lower rocksdb performance can be attributed to 
>> compaction during writes and merging during reads, but I'm not sure if READs 
>> are lower by this magnitude.
>> However, your results seem to show otherwise. Can you please help us with 
>> rockdb config and how the rados bench has been run?
>>
>> Thanks,
>> Sushma
>>
>> -----Original Message-----
>> From: Shu, Xinxin [mailto:[email protected]]
>> Sent: Sunday, June 22, 2014 6:18 PM
>> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
>> Cc: '[email protected]'; Zhang, Jian
>> Subject: RE: [RFC] add rocksdb support
>>
>>
>> Hi all,
>>
>>  We enabled rocksdb as data store in our test setup (10 osds on two servers, 
>> each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU 
>> E31280)  and have performance tests for xfs, leveldb and rocksdb (use rados 
>> bench as our test tool),  the following chart shows details, for write ,  
>> with small number threads , leveldb performance is lower than the other two 
>> backends , from 16 threads point ,  rocksdb perform a little better than xfs 
>> and leveldb , leveldb and rocksdb perform much better than xfs with higher 
>> thread number.
>>
>>                                                   xfs             leveldb    
>>            rocksdb
>>                           throughtput   latency     throughtput latency    
>> throughtput  latency
>> 1 thread write       84.029       0.048             52.430         0.076     
>>          71.920    0.056
>> 2 threads write      166.417      0.048             97.917         0.082     
>>         155.148    0.052
>> 4 threads write       304.099     0.052             156.094         0.102    
>>         270.461    0.059
>> 8 threads write       323.047     0.099             221.370         0.144    
>>         339.455    0.094
>> 16 threads write     295.040      0.216             272.032         0.235    
>>    348.849 0.183
>> 32 threads write     324.467      0.394             290.072          0.441   
>>         338.103    0.378
>> 64 threads write     313.713      0.812             293.261          0.871   
>>   324.603  0.787
>> 1 thread read         75.687      0.053              71.629          0.056   
>>    72.526  0.055
>> 2 threads read        182.329     0.044             151.683           0.053  
>>    153.125 0.052
>> 4 threads read        320.785     0.050             307.180           0.052  
>>     312.016        0.051
>> 8 threads read         504.880    0.063             512.295           0.062  
>>     519.683        0.062
>> 16 threads read       477.706     0.134             643.385           0.099  
>>     654.149        0.098
>> 32 threads read       517.670     0.247              666.696          0.192  
>>     678.480        0.189
>> 64 threads read       516.599     0.495              668.360           0.383 
>>      680.673       0.376
>>
>> -----Original Message-----
>> From: Shu, Xinxin
>> Sent: Saturday, June 14, 2014 11:50 AM
>> To: Sushma Gurram; Mark Nelson; Sage Weil
>> Cc: [email protected]; Zhang, Jian
>> Subject: RE: [RFC] add rocksdb support
>>
>> Currently ceph will get stable rocksdb from branch 3.0.fb of  ceph/rocksdb  
>> , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged ,  so 
>> if you use 'git submodule update --init' to get rocksdb submodule , It did 
>> not support autoconf/automake .
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Sushma Gurram
>> Sent: Saturday, June 14, 2014 2:52 AM
>> To: Shu, Xinxin; Mark Nelson; Sage Weil
>> Cc: [email protected]; Zhang, Jian
>> Subject: RE: [RFC] add rocksdb support
>>
>> Hi Xinxin,
>>
>> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory 
>> seems to be empty. Do I need toput autoconf/automake in this directory?
>> It doesn't seem to have any other source files and compilation fails:
>> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or 
>> directory compilation terminated.
>>
>> Thanks,
>> Sushma
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Shu, Xinxin
>> Sent: Monday, June 09, 2014 10:00 PM
>> To: Mark Nelson; Sage Weil
>> Cc: [email protected]; Zhang, Jian
>> Subject: RE: [RFC] add rocksdb support
>>
>> Hi mark
>>
>> I have finished development of support of rocksdb submodule,  a pull request 
>> for support of autoconf/automake for rocksdb has been created , you can find 
>> https://github.com/ceph/rocksdb/pull/2 , if this patch is ok ,  I will 
>> create a pull request for rocksdb submodule support , currently this patch 
>> can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Mark Nelson
>> Sent: Tuesday, June 10, 2014 1:12 AM
>> To: Shu, Xinxin; Sage Weil
>> Cc: [email protected]; Zhang, Jian
>> Subject: Re: [RFC] add rocksdb support
>>
>> Hi Xinxin,
>>
>> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
>>> Hi sage ,
>>> I will add two configure options to --with-librocksdb-static and 
>>> --with-librocksdb , with --with-librocksdb-static option , ceph will 
>>> compile the code that get from ceph repository , with  --with-librocksdb 
>>> option ,  in case of distro packages for rocksdb , ceph will not compile 
>>> the rocksdb code , will use pre-installed library. is that ok for you ?
>>>
>>> since current rocksdb does not support autoconf&automake , I will add 
>>> autoconf&automake support for rocksdb , but before that , i think we should 
>>> fork a stable branch (maybe 3.0) for ceph .
>>
>> I'm looking at testing out the rocksdb support as well, both for the OSD and 
>> for the monitor based on some issues we've been seeing lately.  Any news on 
>> the 3.0 fork and autoconf/automake support in rocksdb?
>>
>> Thanks,
>> Mark
>>
>>>
>>> -----Original Message-----
>>> From: Mark Nelson [mailto:[email protected]]
>>> Sent: Wednesday, May 21, 2014 9:06 PM
>>> To: Shu, Xinxin; Sage Weil
>>> Cc: [email protected]; Zhang, Jian
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>>>> Hi, sage
>>>>
>>>> I will add rocksdb submodule into the makefile , currently we want to have 
>>>> fully performance tests on key-value db backend , both leveldb and 
>>>> rocksdb. Then optimize on rocksdb performance.
>>>
>>> I'm definitely interested in any performance tests you do here.  Last 
>>> winter I started doing some fairly high level tests on raw 
>>> leveldb/hyperleveldb/raikleveldb.  I'm very interested in what you see with 
>>> rocksdb as a backend.
>>>
>>>>
>>>> -----Original Message-----
>>>> From: Sage Weil [mailto:[email protected]]
>>>> Sent: Wednesday, May 21, 2014 9:19 AM
>>>> To: Shu, Xinxin
>>>> Cc: [email protected]
>>>> Subject: Re: [RFC] add rocksdb support
>>>>
>>>> Hi Xinxin,
>>>>
>>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that 
>>>> includes the latest set of patches with the groundwork and your rocksdb 
>>>> patch.  There is also a commit that adds rocksdb as a git submodule.  I'm 
>>>> thinking that, since there aren't any distro packages for rocksdb at this 
>>>> point, this is going to be the easiest way to make this usable for people.
>>>>
>>>> If you can wire the submodule into the makefile, we can merge this in so 
>>>> that rocksdb support is in the ceph.com packages on ceph.com.  I suspect 
>>>> that the distros will prefer to turns this off in favor of separate shared 
>>>> libs, but they can do this at their option if/when they include rocksdb in 
>>>> the distro. I think the key is just to have both --with-librockdb and 
>>>> --with-librocksdb-static (or similar) options so that you can either use 
>>>> the static or dynamically linked one.
>>>>
>>>> Has your group done further testing with rocksdb?  Anything interesting to 
>>>> share?
>>>>
>>>> Thanks!
>>>> sage
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to [email protected] More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to [email protected] More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to [email protected] More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>> ________________________________
>>
>> PLEASE NOTE: The information contained in this electronic mail message is 
>> intended only for the use of the designated recipient(s) named above. If the 
>> reader of this message is not the intended recipient, you are hereby 
>> notified that you have received this message in error and that any review, 
>> dissemination, distribution, or copying of this message is strictly 
>> prohibited. If you have received this communication in error, please notify 
>> the sender by telephone or e-mail (as shown above) immediately and destroy 
>> any and all copies of this message in your possession (whether hard copies 
>> or electronically stored copies).
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to [email protected] More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to [email protected] More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to