Re: [RFC] add rocksdb support

Haomai Wang Tue, 01 Jul 2014 10:03:12 -0700

On Tue, Jul 1, 2014 at 11:15 PM, Sushma Gurram
<[email protected]> wrote:
> Haomoi,
>
> Is there any write up on keyvalue store header cache and strip size? Based on 
> what you stated, it appears that strip size improves performance with large 
> object sizes. How would header cache impact 4KB object sizes?


Hmm, I think we need to throw your demand firstly. I don't think 4KB
object size is a good size for both FileStore and KeyValueStore. Even
if using 4KB object size, The main bottleneck for FileStore will be
"File", for KeyValueStore it may be more complex. I agree with
"header_lock" should be a problem.

> We'd like to guesstimate the improvement due to strip size and header cache. 
> I'm not sure about header cache implementation yet, but fdcache had 
> serialization issues and there was a sharded fdcache to address this (under 
> review, I guess).

Yes, fdcache has many problems not only concurrent operations but also
the large size problem. So I introduce RandomCache to avoid it.

>
> I believe the header_lock serialization exists in all ceph branches so far, 
> including the master.

Yes, I don't query the "header_lock". My question is that whether your
estimate branch has DBObjectMap header cache, if enable header cache,
is the "header_lock" still be a awful point? Same as KeyValueStore, I
will try to see too.


>
> Thanks,
> Sushma
>
> -----Original Message-----
> From: Haomai Wang [mailto:[email protected]]
> Sent: Tuesday, July 01, 2014 1:06 AM
> To: Somnath Roy
> Cc: Sushma Gurram; Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; 
> [email protected]
> Subject: Re: [RFC] add rocksdb support
>
> Hi,
>
> I don't know why OSD capacity can be PB level. Actually, most of use case 
> should be serval TBs(1-4TB). As for cache hit, it totally depend on the IO 
> characteristic. In my opinion, header cache in KeyValueStore can meet hit 
> cache mostly if config object size and strip
> size(KeyValueStore) properly.
>
> But I'm also interested in your lock comments, what ceph version do you 
> estimate with serialization issue?
>
> On Tue, Jul 1, 2014 at 3:13 PM, Somnath Roy <[email protected]> wrote:
>> Hi Haomai,
>> But, the cache hit will be very minimal or null, if the actual storage per 
>> node is very huge (say in the PB level). So, it will be mostly hitting Omap, 
>> isn't it ?
>> How this header cache is going to resolve this serialization issue then ?
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Haomai Wang
>> Sent: Monday, June 30, 2014 11:10 PM
>> To: Sushma Gurram
>> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
>> [email protected]
>> Subject: Re: [RFC] add rocksdb support
>>
>> Hi Sushma,
>>
>> Thanks for your investigations! We already noticed the serializing risk on 
>> GenericObjectMap/DBObjectMap. In order to improve performance we add header 
>> cache to DBObjectMap.
>>
>> As for KeyValueStore, a cache branch is on the reviewing, it can greatly 
>> reduce lookup_header calls. Of course, replace with RWLock is a good 
>> suggestion, I would like to try to estimate!
>>
>> On Tue, Jul 1, 2014 at 8:39 AM, Sushma Gurram <[email protected]> 
>> wrote:
>>> Hi Haomai/Greg,
>>>
>>> I tried to analyze this a bit more and it appears that the 
>>> GenericObjectMap::header_lock is serializing the READ requests in the 
>>> following path and hence the low performance numbers with KeyValueStore.
>>> ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() ->
>>> ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr()
>>> ->
>>> KeyValueStore::getattr() -> GenericObjectMap::get_values() ->
>>> GenericObjectMap::lookup_header()
>>>
>>> I fabricated the code to avoid this lock for a specific run and noticed 
>>> that the performance is similar to FileStore.
>>>
>>> In our earlier investigations also we noticed similar serialization issues 
>>> with DBObjectMap::header_lock when rgw stores xattrs in LevelDB.
>>>
>>> Can you please help understand the reason for this lock and whether it can 
>>> be replaced with a RWLock or any other suggestions to avoid serialization 
>>> due to this lock?
>>>
>>> Thanks,
>>> Sushma
>>>
>>> -----Original Message-----
>>> From: Haomai Wang [mailto:[email protected]]
>>> Sent: Friday, June 27, 2014 1:08 AM
>>> To: Sushma Gurram
>>> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
>>> [email protected]
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> As I mentioned days ago:
>>>
>>> There exists two points related kvstore perf:
>>> 1. The order of image and the strip
>>> size are important to performance. Because the header like inode in fs is 
>>> much lightweight than fd, so the order of image is expected to be lower. 
>>> And strip size can be configurated to 4kb to improve large io performance.
>>> 2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, 
>>> the header cache is important to perf. It's just like fdcahce in FileStore.
>>>
>>> As for detail perf number, I think this result based on master branch is 
>>> nearly correct. When strip-size and header cache are ready, I think it will 
>>> be better.
>>>
>>> On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <[email protected]> 
>>> wrote:
>>>> Delivery failure due to table format. Resending as plain text.
>>>>
>>>> _____________________________________________
>>>> From: Sushma Gurram
>>>> Sent: Thursday, June 26, 2014 5:35 PM
>>>> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
>>>> Cc: 'Zhang, Jian'; [email protected]
>>>> Subject: RE: [RFC] add rocksdb support
>>>>
>>>>
>>>> Hi Xinxin,
>>>>
>>>> Thanks for providing the results of the performance tests.
>>>>
>>>> I used fio (with support for rbd ioengine) to compare XFS and RockDB with 
>>>> a single OSD. Also confirmed with rados bench and both numbers seem to be 
>>>> of the same order.
>>>> My findings show that XFS is better than rocksdb. Can you please let us 
>>>> know rocksdb configuration that you used, object size and duration of run 
>>>> for rados bench?
>>>> For random writes tests, I see "rocksdb:bg0" thread as the top CPU 
>>>> consumer (%CPU of this thread is 50, while that of all other threads in 
>>>> the OSD is <10% utilized).
>>>> Is there a ceph.conf config option to configure the background threads in 
>>>> rocksdb?
>>>>
>>>> We ran our tests with following configuration:
>>>> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical
>>>> cores), HT disabled, 16 GB memory
>>>>
>>>> rocksdb configuration has been set to the following values in ceph.conf.
>>>>         rocksdb_write_buffer_size = 4194304
>>>>         rocksdb_cache_size = 4194304
>>>>         rocksdb_bloom_size = 0
>>>>         rocksdb_max_open_files = 10240
>>>>         rocksdb_compression = false
>>>>         rocksdb_paranoid = false
>>>>         rocksdb_log = /dev/null
>>>>         rocksdb_compact_on_mount = false
>>>>
>>>> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, 
>>>> iodepth=32. Unlike rados bench, fio rbd helps to create multiple 
>>>> (=numjobs) client connections to the OSD, thus stressing the OSD.
>>>>
>>>> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
>>>> -------------------------------------------------------------------
>>>> IO Pattern      XFS (IOPs)      Rocksdb (IOPs)
>>>> 4K writes       ~1450           ~670
>>>> 4K reads        ~65000          ~2000
>>>> 64K writes      ~431            ~57
>>>> 64K reads       ~17500          ~180
>>>>
>>>>
>>>> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
>>>> -------------------------------------------------------------------
>>>> IO Pattern      XFS (IOPs)      Rocksdb (IOPs)
>>>> 4K writes       ~1450           ~962
>>>> 4K reads        ~65000          ~1641
>>>> 64K writes      ~431            ~426
>>>> 64K reads       ~17500          ~209
>>>>
>>>> I guess theoretically lower rocksdb performance can be attributed to 
>>>> compaction during writes and merging during reads, but I'm not sure if 
>>>> READs are lower by this magnitude.
>>>> However, your results seem to show otherwise. Can you please help us with 
>>>> rockdb config and how the rados bench has been run?
>>>>
>>>> Thanks,
>>>> Sushma
>>>>
>>>> -----Original Message-----
>>>> From: Shu, Xinxin [mailto:[email protected]]
>>>> Sent: Sunday, June 22, 2014 6:18 PM
>>>> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
>>>> Cc: '[email protected]'; Zhang, Jian
>>>> Subject: RE: [RFC] add rocksdb support
>>>>
>>>>
>>>> Hi all,
>>>>
>>>>  We enabled rocksdb as data store in our test setup (10 osds on two 
>>>> servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) 
>>>> Xeon(R) CPU E31280)  and have performance tests for xfs, leveldb and 
>>>> rocksdb (use rados bench as our test tool),  the following chart shows 
>>>> details, for write ,  with small number threads , leveldb performance is 
>>>> lower than the other two backends , from 16 threads point ,  rocksdb 
>>>> perform a little better than xfs and leveldb , leveldb and rocksdb perform 
>>>> much better than xfs with higher thread number.
>>>>
>>>>                                                   xfs             leveldb  
>>>>              rocksdb
>>>>                           throughtput   latency     throughtput latency    
>>>> throughtput  latency
>>>> 1 thread write       84.029       0.048             52.430         0.076   
>>>>            71.920    0.056
>>>> 2 threads write      166.417      0.048             97.917         0.082   
>>>>           155.148    0.052
>>>> 4 threads write       304.099     0.052             156.094         0.102  
>>>>           270.461    0.059
>>>> 8 threads write       323.047     0.099             221.370         0.144  
>>>>           339.455    0.094
>>>> 16 threads write     295.040      0.216             272.032         0.235  
>>>>      348.849 0.183
>>>> 32 threads write     324.467      0.394             290.072          0.441 
>>>>           338.103    0.378
>>>> 64 threads write     313.713      0.812             293.261          0.871 
>>>>     324.603  0.787
>>>> 1 thread read         75.687      0.053              71.629          0.056 
>>>>      72.526  0.055
>>>> 2 threads read        182.329     0.044             151.683           
>>>> 0.053     153.125 0.052
>>>> 4 threads read        320.785     0.050             307.180           
>>>> 0.052      312.016        0.051
>>>> 8 threads read         504.880    0.063             512.295           
>>>> 0.062      519.683        0.062
>>>> 16 threads read       477.706     0.134             643.385           
>>>> 0.099      654.149        0.098
>>>> 32 threads read       517.670     0.247              666.696          
>>>> 0.192      678.480        0.189
>>>> 64 threads read       516.599     0.495              668.360           
>>>> 0.383      680.673       0.376
>>>>
>>>> -----Original Message-----
>>>> From: Shu, Xinxin
>>>> Sent: Saturday, June 14, 2014 11:50 AM
>>>> To: Sushma Gurram; Mark Nelson; Sage Weil
>>>> Cc: [email protected]; Zhang, Jian
>>>> Subject: RE: [RFC] add rocksdb support
>>>>
>>>> Currently ceph will get stable rocksdb from branch 3.0.fb of  ceph/rocksdb 
>>>>  , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged ,  
>>>> so if you use 'git submodule update --init' to get rocksdb submodule , It 
>>>> did not support autoconf/automake .
>>>>
>>>> -----Original Message-----
>>>> From: [email protected]
>>>> [mailto:[email protected]] On Behalf Of Sushma Gurram
>>>> Sent: Saturday, June 14, 2014 2:52 AM
>>>> To: Shu, Xinxin; Mark Nelson; Sage Weil
>>>> Cc: [email protected]; Zhang, Jian
>>>> Subject: RE: [RFC] add rocksdb support
>>>>
>>>> Hi Xinxin,
>>>>
>>>> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory 
>>>> seems to be empty. Do I need toput autoconf/automake in this directory?
>>>> It doesn't seem to have any other source files and compilation fails:
>>>> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or 
>>>> directory compilation terminated.
>>>>
>>>> Thanks,
>>>> Sushma
>>>>
>>>> -----Original Message-----
>>>> From: [email protected]
>>>> [mailto:[email protected]] On Behalf Of Shu, Xinxin
>>>> Sent: Monday, June 09, 2014 10:00 PM
>>>> To: Mark Nelson; Sage Weil
>>>> Cc: [email protected]; Zhang, Jian
>>>> Subject: RE: [RFC] add rocksdb support
>>>>
>>>> Hi mark
>>>>
>>>> I have finished development of support of rocksdb submodule,  a pull 
>>>> request for support of autoconf/automake for rocksdb has been created , 
>>>> you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok 
>>>> ,  I will create a pull request for rocksdb submodule support , currently 
>>>> this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
>>>>
>>>> -----Original Message-----
>>>> From: [email protected]
>>>> [mailto:[email protected]] On Behalf Of Mark Nelson
>>>> Sent: Tuesday, June 10, 2014 1:12 AM
>>>> To: Shu, Xinxin; Sage Weil
>>>> Cc: [email protected]; Zhang, Jian
>>>> Subject: Re: [RFC] add rocksdb support
>>>>
>>>> Hi Xinxin,
>>>>
>>>> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
>>>>> Hi sage ,
>>>>> I will add two configure options to --with-librocksdb-static and 
>>>>> --with-librocksdb , with --with-librocksdb-static option , ceph will 
>>>>> compile the code that get from ceph repository , with  --with-librocksdb 
>>>>> option ,  in case of distro packages for rocksdb , ceph will not compile 
>>>>> the rocksdb code , will use pre-installed library. is that ok for you ?
>>>>>
>>>>> since current rocksdb does not support autoconf&automake , I will add 
>>>>> autoconf&automake support for rocksdb , but before that , i think we 
>>>>> should fork a stable branch (maybe 3.0) for ceph .
>>>>
>>>> I'm looking at testing out the rocksdb support as well, both for the OSD 
>>>> and for the monitor based on some issues we've been seeing lately.  Any 
>>>> news on the 3.0 fork and autoconf/automake support in rocksdb?
>>>>
>>>> Thanks,
>>>> Mark
>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Mark Nelson [mailto:[email protected]]
>>>>> Sent: Wednesday, May 21, 2014 9:06 PM
>>>>> To: Shu, Xinxin; Sage Weil
>>>>> Cc: [email protected]; Zhang, Jian
>>>>> Subject: Re: [RFC] add rocksdb support
>>>>>
>>>>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>>>>>> Hi, sage
>>>>>>
>>>>>> I will add rocksdb submodule into the makefile , currently we want to 
>>>>>> have fully performance tests on key-value db backend , both leveldb and 
>>>>>> rocksdb. Then optimize on rocksdb performance.
>>>>>
>>>>> I'm definitely interested in any performance tests you do here.  Last 
>>>>> winter I started doing some fairly high level tests on raw 
>>>>> leveldb/hyperleveldb/raikleveldb.  I'm very interested in what you see 
>>>>> with rocksdb as a backend.
>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Sage Weil [mailto:[email protected]]
>>>>>> Sent: Wednesday, May 21, 2014 9:19 AM
>>>>>> To: Shu, Xinxin
>>>>>> Cc: [email protected]
>>>>>> Subject: Re: [RFC] add rocksdb support
>>>>>>
>>>>>> Hi Xinxin,
>>>>>>
>>>>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that 
>>>>>> includes the latest set of patches with the groundwork and your rocksdb 
>>>>>> patch.  There is also a commit that adds rocksdb as a git submodule.  
>>>>>> I'm thinking that, since there aren't any distro packages for rocksdb at 
>>>>>> this point, this is going to be the easiest way to make this usable for 
>>>>>> people.
>>>>>>
>>>>>> If you can wire the submodule into the makefile, we can merge this in so 
>>>>>> that rocksdb support is in the ceph.com packages on ceph.com.  I suspect 
>>>>>> that the distros will prefer to turns this off in favor of separate 
>>>>>> shared libs, but they can do this at their option if/when they include 
>>>>>> rocksdb in the distro. I think the key is just to have both 
>>>>>> --with-librockdb and --with-librocksdb-static (or similar) options so 
>>>>>> that you can either use the static or dynamically linked one.
>>>>>>
>>>>>> Has your group done further testing with rocksdb?  Anything interesting 
>>>>>> to share?
>>>>>>
>>>>>> Thanks!
>>>>>> sage
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to [email protected] More
>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to [email protected] More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to [email protected] More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>> ________________________________
>>>>
>>>> PLEASE NOTE: The information contained in this electronic mail message is 
>>>> intended only for the use of the designated recipient(s) named above. If 
>>>> the reader of this message is not the intended recipient, you are hereby 
>>>> notified that you have received this message in error and that any review, 
>>>> dissemination, distribution, or copying of this message is strictly 
>>>> prohibited. If you have received this communication in error, please 
>>>> notify the sender by telephone or e-mail (as shown above) immediately and 
>>>> destroy any and all copies of this message in your possession (whether 
>>>> hard copies or electronically stored copies).
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to [email protected] More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to [email protected] More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to [email protected] More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] add rocksdb support

Reply via email to