On Tue, Jul 1, 2014 at 11:15 PM, Sushma Gurram <[email protected]> wrote: > Haomoi, > > Is there any write up on keyvalue store header cache and strip size? Based on > what you stated, it appears that strip size improves performance with large > object sizes. How would header cache impact 4KB object sizes?
Hmm, I think we need to throw your demand firstly. I don't think 4KB object size is a good size for both FileStore and KeyValueStore. Even if using 4KB object size, The main bottleneck for FileStore will be "File", for KeyValueStore it may be more complex. I agree with "header_lock" should be a problem. > We'd like to guesstimate the improvement due to strip size and header cache. > I'm not sure about header cache implementation yet, but fdcache had > serialization issues and there was a sharded fdcache to address this (under > review, I guess). Yes, fdcache has many problems not only concurrent operations but also the large size problem. So I introduce RandomCache to avoid it. > > I believe the header_lock serialization exists in all ceph branches so far, > including the master. Yes, I don't query the "header_lock". My question is that whether your estimate branch has DBObjectMap header cache, if enable header cache, is the "header_lock" still be a awful point? Same as KeyValueStore, I will try to see too. > > Thanks, > Sushma > > -----Original Message----- > From: Haomai Wang [mailto:[email protected]] > Sent: Tuesday, July 01, 2014 1:06 AM > To: Somnath Roy > Cc: Sushma Gurram; Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; > [email protected] > Subject: Re: [RFC] add rocksdb support > > Hi, > > I don't know why OSD capacity can be PB level. Actually, most of use case > should be serval TBs(1-4TB). As for cache hit, it totally depend on the IO > characteristic. In my opinion, header cache in KeyValueStore can meet hit > cache mostly if config object size and strip > size(KeyValueStore) properly. > > But I'm also interested in your lock comments, what ceph version do you > estimate with serialization issue? > > On Tue, Jul 1, 2014 at 3:13 PM, Somnath Roy <[email protected]> wrote: >> Hi Haomai, >> But, the cache hit will be very minimal or null, if the actual storage per >> node is very huge (say in the PB level). So, it will be mostly hitting Omap, >> isn't it ? >> How this header cache is going to resolve this serialization issue then ? >> >> Thanks & Regards >> Somnath >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Haomai Wang >> Sent: Monday, June 30, 2014 11:10 PM >> To: Sushma Gurram >> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; >> [email protected] >> Subject: Re: [RFC] add rocksdb support >> >> Hi Sushma, >> >> Thanks for your investigations! We already noticed the serializing risk on >> GenericObjectMap/DBObjectMap. In order to improve performance we add header >> cache to DBObjectMap. >> >> As for KeyValueStore, a cache branch is on the reviewing, it can greatly >> reduce lookup_header calls. Of course, replace with RWLock is a good >> suggestion, I would like to try to estimate! >> >> On Tue, Jul 1, 2014 at 8:39 AM, Sushma Gurram <[email protected]> >> wrote: >>> Hi Haomai/Greg, >>> >>> I tried to analyze this a bit more and it appears that the >>> GenericObjectMap::header_lock is serializing the READ requests in the >>> following path and hence the low performance numbers with KeyValueStore. >>> ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() -> >>> ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr() >>> -> >>> KeyValueStore::getattr() -> GenericObjectMap::get_values() -> >>> GenericObjectMap::lookup_header() >>> >>> I fabricated the code to avoid this lock for a specific run and noticed >>> that the performance is similar to FileStore. >>> >>> In our earlier investigations also we noticed similar serialization issues >>> with DBObjectMap::header_lock when rgw stores xattrs in LevelDB. >>> >>> Can you please help understand the reason for this lock and whether it can >>> be replaced with a RWLock or any other suggestions to avoid serialization >>> due to this lock? >>> >>> Thanks, >>> Sushma >>> >>> -----Original Message----- >>> From: Haomai Wang [mailto:[email protected]] >>> Sent: Friday, June 27, 2014 1:08 AM >>> To: Sushma Gurram >>> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; >>> [email protected] >>> Subject: Re: [RFC] add rocksdb support >>> >>> As I mentioned days ago: >>> >>> There exists two points related kvstore perf: >>> 1. The order of image and the strip >>> size are important to performance. Because the header like inode in fs is >>> much lightweight than fd, so the order of image is expected to be lower. >>> And strip size can be configurated to 4kb to improve large io performance. >>> 2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, >>> the header cache is important to perf. It's just like fdcahce in FileStore. >>> >>> As for detail perf number, I think this result based on master branch is >>> nearly correct. When strip-size and header cache are ready, I think it will >>> be better. >>> >>> On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <[email protected]> >>> wrote: >>>> Delivery failure due to table format. Resending as plain text. >>>> >>>> _____________________________________________ >>>> From: Sushma Gurram >>>> Sent: Thursday, June 26, 2014 5:35 PM >>>> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil' >>>> Cc: 'Zhang, Jian'; [email protected] >>>> Subject: RE: [RFC] add rocksdb support >>>> >>>> >>>> Hi Xinxin, >>>> >>>> Thanks for providing the results of the performance tests. >>>> >>>> I used fio (with support for rbd ioengine) to compare XFS and RockDB with >>>> a single OSD. Also confirmed with rados bench and both numbers seem to be >>>> of the same order. >>>> My findings show that XFS is better than rocksdb. Can you please let us >>>> know rocksdb configuration that you used, object size and duration of run >>>> for rados bench? >>>> For random writes tests, I see "rocksdb:bg0" thread as the top CPU >>>> consumer (%CPU of this thread is 50, while that of all other threads in >>>> the OSD is <10% utilized). >>>> Is there a ceph.conf config option to configure the background threads in >>>> rocksdb? >>>> >>>> We ran our tests with following configuration: >>>> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical >>>> cores), HT disabled, 16 GB memory >>>> >>>> rocksdb configuration has been set to the following values in ceph.conf. >>>> rocksdb_write_buffer_size = 4194304 >>>> rocksdb_cache_size = 4194304 >>>> rocksdb_bloom_size = 0 >>>> rocksdb_max_open_files = 10240 >>>> rocksdb_compression = false >>>> rocksdb_paranoid = false >>>> rocksdb_log = /dev/null >>>> rocksdb_compact_on_mount = false >>>> >>>> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, >>>> iodepth=32. Unlike rados bench, fio rbd helps to create multiple >>>> (=numjobs) client connections to the OSD, thus stressing the OSD. >>>> >>>> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB >>>> ------------------------------------------------------------------- >>>> IO Pattern XFS (IOPs) Rocksdb (IOPs) >>>> 4K writes ~1450 ~670 >>>> 4K reads ~65000 ~2000 >>>> 64K writes ~431 ~57 >>>> 64K reads ~17500 ~180 >>>> >>>> >>>> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB >>>> ------------------------------------------------------------------- >>>> IO Pattern XFS (IOPs) Rocksdb (IOPs) >>>> 4K writes ~1450 ~962 >>>> 4K reads ~65000 ~1641 >>>> 64K writes ~431 ~426 >>>> 64K reads ~17500 ~209 >>>> >>>> I guess theoretically lower rocksdb performance can be attributed to >>>> compaction during writes and merging during reads, but I'm not sure if >>>> READs are lower by this magnitude. >>>> However, your results seem to show otherwise. Can you please help us with >>>> rockdb config and how the rados bench has been run? >>>> >>>> Thanks, >>>> Sushma >>>> >>>> -----Original Message----- >>>> From: Shu, Xinxin [mailto:[email protected]] >>>> Sent: Sunday, June 22, 2014 6:18 PM >>>> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil' >>>> Cc: '[email protected]'; Zhang, Jian >>>> Subject: RE: [RFC] add rocksdb support >>>> >>>> >>>> Hi all, >>>> >>>> We enabled rocksdb as data store in our test setup (10 osds on two >>>> servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) >>>> Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and >>>> rocksdb (use rados bench as our test tool), the following chart shows >>>> details, for write , with small number threads , leveldb performance is >>>> lower than the other two backends , from 16 threads point , rocksdb >>>> perform a little better than xfs and leveldb , leveldb and rocksdb perform >>>> much better than xfs with higher thread number. >>>> >>>> xfs leveldb >>>> rocksdb >>>> throughtput latency throughtput latency >>>> throughtput latency >>>> 1 thread write 84.029 0.048 52.430 0.076 >>>> 71.920 0.056 >>>> 2 threads write 166.417 0.048 97.917 0.082 >>>> 155.148 0.052 >>>> 4 threads write 304.099 0.052 156.094 0.102 >>>> 270.461 0.059 >>>> 8 threads write 323.047 0.099 221.370 0.144 >>>> 339.455 0.094 >>>> 16 threads write 295.040 0.216 272.032 0.235 >>>> 348.849 0.183 >>>> 32 threads write 324.467 0.394 290.072 0.441 >>>> 338.103 0.378 >>>> 64 threads write 313.713 0.812 293.261 0.871 >>>> 324.603 0.787 >>>> 1 thread read 75.687 0.053 71.629 0.056 >>>> 72.526 0.055 >>>> 2 threads read 182.329 0.044 151.683 >>>> 0.053 153.125 0.052 >>>> 4 threads read 320.785 0.050 307.180 >>>> 0.052 312.016 0.051 >>>> 8 threads read 504.880 0.063 512.295 >>>> 0.062 519.683 0.062 >>>> 16 threads read 477.706 0.134 643.385 >>>> 0.099 654.149 0.098 >>>> 32 threads read 517.670 0.247 666.696 >>>> 0.192 678.480 0.189 >>>> 64 threads read 516.599 0.495 668.360 >>>> 0.383 680.673 0.376 >>>> >>>> -----Original Message----- >>>> From: Shu, Xinxin >>>> Sent: Saturday, June 14, 2014 11:50 AM >>>> To: Sushma Gurram; Mark Nelson; Sage Weil >>>> Cc: [email protected]; Zhang, Jian >>>> Subject: RE: [RFC] add rocksdb support >>>> >>>> Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb >>>> , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , >>>> so if you use 'git submodule update --init' to get rocksdb submodule , It >>>> did not support autoconf/automake . >>>> >>>> -----Original Message----- >>>> From: [email protected] >>>> [mailto:[email protected]] On Behalf Of Sushma Gurram >>>> Sent: Saturday, June 14, 2014 2:52 AM >>>> To: Shu, Xinxin; Mark Nelson; Sage Weil >>>> Cc: [email protected]; Zhang, Jian >>>> Subject: RE: [RFC] add rocksdb support >>>> >>>> Hi Xinxin, >>>> >>>> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory >>>> seems to be empty. Do I need toput autoconf/automake in this directory? >>>> It doesn't seem to have any other source files and compilation fails: >>>> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or >>>> directory compilation terminated. >>>> >>>> Thanks, >>>> Sushma >>>> >>>> -----Original Message----- >>>> From: [email protected] >>>> [mailto:[email protected]] On Behalf Of Shu, Xinxin >>>> Sent: Monday, June 09, 2014 10:00 PM >>>> To: Mark Nelson; Sage Weil >>>> Cc: [email protected]; Zhang, Jian >>>> Subject: RE: [RFC] add rocksdb support >>>> >>>> Hi mark >>>> >>>> I have finished development of support of rocksdb submodule, a pull >>>> request for support of autoconf/automake for rocksdb has been created , >>>> you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok >>>> , I will create a pull request for rocksdb submodule support , currently >>>> this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb . >>>> >>>> -----Original Message----- >>>> From: [email protected] >>>> [mailto:[email protected]] On Behalf Of Mark Nelson >>>> Sent: Tuesday, June 10, 2014 1:12 AM >>>> To: Shu, Xinxin; Sage Weil >>>> Cc: [email protected]; Zhang, Jian >>>> Subject: Re: [RFC] add rocksdb support >>>> >>>> Hi Xinxin, >>>> >>>> On 05/28/2014 05:05 AM, Shu, Xinxin wrote: >>>>> Hi sage , >>>>> I will add two configure options to --with-librocksdb-static and >>>>> --with-librocksdb , with --with-librocksdb-static option , ceph will >>>>> compile the code that get from ceph repository , with --with-librocksdb >>>>> option , in case of distro packages for rocksdb , ceph will not compile >>>>> the rocksdb code , will use pre-installed library. is that ok for you ? >>>>> >>>>> since current rocksdb does not support autoconf&automake , I will add >>>>> autoconf&automake support for rocksdb , but before that , i think we >>>>> should fork a stable branch (maybe 3.0) for ceph . >>>> >>>> I'm looking at testing out the rocksdb support as well, both for the OSD >>>> and for the monitor based on some issues we've been seeing lately. Any >>>> news on the 3.0 fork and autoconf/automake support in rocksdb? >>>> >>>> Thanks, >>>> Mark >>>> >>>>> >>>>> -----Original Message----- >>>>> From: Mark Nelson [mailto:[email protected]] >>>>> Sent: Wednesday, May 21, 2014 9:06 PM >>>>> To: Shu, Xinxin; Sage Weil >>>>> Cc: [email protected]; Zhang, Jian >>>>> Subject: Re: [RFC] add rocksdb support >>>>> >>>>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote: >>>>>> Hi, sage >>>>>> >>>>>> I will add rocksdb submodule into the makefile , currently we want to >>>>>> have fully performance tests on key-value db backend , both leveldb and >>>>>> rocksdb. Then optimize on rocksdb performance. >>>>> >>>>> I'm definitely interested in any performance tests you do here. Last >>>>> winter I started doing some fairly high level tests on raw >>>>> leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see >>>>> with rocksdb as a backend. >>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Sage Weil [mailto:[email protected]] >>>>>> Sent: Wednesday, May 21, 2014 9:19 AM >>>>>> To: Shu, Xinxin >>>>>> Cc: [email protected] >>>>>> Subject: Re: [RFC] add rocksdb support >>>>>> >>>>>> Hi Xinxin, >>>>>> >>>>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that >>>>>> includes the latest set of patches with the groundwork and your rocksdb >>>>>> patch. There is also a commit that adds rocksdb as a git submodule. >>>>>> I'm thinking that, since there aren't any distro packages for rocksdb at >>>>>> this point, this is going to be the easiest way to make this usable for >>>>>> people. >>>>>> >>>>>> If you can wire the submodule into the makefile, we can merge this in so >>>>>> that rocksdb support is in the ceph.com packages on ceph.com. I suspect >>>>>> that the distros will prefer to turns this off in favor of separate >>>>>> shared libs, but they can do this at their option if/when they include >>>>>> rocksdb in the distro. I think the key is just to have both >>>>>> --with-librockdb and --with-librocksdb-static (or similar) options so >>>>>> that you can either use the static or dynamically linked one. >>>>>> >>>>>> Has your group done further testing with rocksdb? Anything interesting >>>>>> to share? >>>>>> >>>>>> Thanks! >>>>>> sage >>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>> in the body of a message to [email protected] More >>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>>> >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>> in the body of a message to [email protected] More majordomo >>>> info at http://vger.kernel.org/majordomo-info.html >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>> in the body of a message to [email protected] More majordomo >>>> info at http://vger.kernel.org/majordomo-info.html >>>> >>>> ________________________________ >>>> >>>> PLEASE NOTE: The information contained in this electronic mail message is >>>> intended only for the use of the designated recipient(s) named above. If >>>> the reader of this message is not the intended recipient, you are hereby >>>> notified that you have received this message in error and that any review, >>>> dissemination, distribution, or copying of this message is strictly >>>> prohibited. If you have received this communication in error, please >>>> notify the sender by telephone or e-mail (as shown above) immediately and >>>> destroy any and all copies of this message in your possession (whether >>>> hard copies or electronically stored copies). >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>> in the body of a message to [email protected] More majordomo >>>> info at http://vger.kernel.org/majordomo-info.html >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>> in the body of a message to [email protected] More majordomo >>>> info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> >>> -- >>> Best Regards, >>> >>> Wheat >> >> >> >> -- >> Best Regards, >> >> Wheat >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> in the body of a message to [email protected] More majordomo >> info at http://vger.kernel.org/majordomo-info.html > > > > -- > Best Regards, > > Wheat -- Best Regards, Wheat -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
