Hi Sushma, Thanks for your investigations! We already noticed the serializing risk on GenericObjectMap/DBObjectMap. In order to improve performance we add header cache to DBObjectMap.
As for KeyValueStore, a cache branch is on the reviewing, it can greatly reduce lookup_header calls. Of course, replace with RWLock is a good suggestion, I would like to try to estimate! On Tue, Jul 1, 2014 at 8:39 AM, Sushma Gurram <[email protected]> wrote: > Hi Haomai/Greg, > > I tried to analyze this a bit more and it appears that the > GenericObjectMap::header_lock is serializing the READ requests in the > following path and hence the low performance numbers with KeyValueStore. > ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() -> > ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr() -> > KeyValueStore::getattr() -> GenericObjectMap::get_values() -> > GenericObjectMap::lookup_header() > > I fabricated the code to avoid this lock for a specific run and noticed that > the performance is similar to FileStore. > > In our earlier investigations also we noticed similar serialization issues > with DBObjectMap::header_lock when rgw stores xattrs in LevelDB. > > Can you please help understand the reason for this lock and whether it can be > replaced with a RWLock or any other suggestions to avoid serialization due to > this lock? > > Thanks, > Sushma > > -----Original Message----- > From: Haomai Wang [mailto:[email protected]] > Sent: Friday, June 27, 2014 1:08 AM > To: Sushma Gurram > Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; > [email protected] > Subject: Re: [RFC] add rocksdb support > > As I mentioned days ago: > > There exists two points related kvstore perf: > 1. The order of image and the strip > size are important to performance. Because the header like inode in fs is > much lightweight than fd, so the order of image is expected to be lower. And > strip size can be configurated to 4kb to improve large io performance. > 2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, > the header cache is important to perf. It's just like fdcahce in FileStore. > > As for detail perf number, I think this result based on master branch is > nearly correct. When strip-size and header cache are ready, I think it will > be better. > > On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <[email protected]> > wrote: >> Delivery failure due to table format. Resending as plain text. >> >> _____________________________________________ >> From: Sushma Gurram >> Sent: Thursday, June 26, 2014 5:35 PM >> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil' >> Cc: 'Zhang, Jian'; [email protected] >> Subject: RE: [RFC] add rocksdb support >> >> >> Hi Xinxin, >> >> Thanks for providing the results of the performance tests. >> >> I used fio (with support for rbd ioengine) to compare XFS and RockDB with a >> single OSD. Also confirmed with rados bench and both numbers seem to be of >> the same order. >> My findings show that XFS is better than rocksdb. Can you please let us know >> rocksdb configuration that you used, object size and duration of run for >> rados bench? >> For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer >> (%CPU of this thread is 50, while that of all other threads in the OSD is >> <10% utilized). >> Is there a ceph.conf config option to configure the background threads in >> rocksdb? >> >> We ran our tests with following configuration: >> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical cores), >> HT disabled, 16 GB memory >> >> rocksdb configuration has been set to the following values in ceph.conf. >> rocksdb_write_buffer_size = 4194304 >> rocksdb_cache_size = 4194304 >> rocksdb_bloom_size = 0 >> rocksdb_max_open_files = 10240 >> rocksdb_compression = false >> rocksdb_paranoid = false >> rocksdb_log = /dev/null >> rocksdb_compact_on_mount = false >> >> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, >> iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) >> client connections to the OSD, thus stressing the OSD. >> >> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB >> ------------------------------------------------------------------- >> IO Pattern XFS (IOPs) Rocksdb (IOPs) >> 4K writes ~1450 ~670 >> 4K reads ~65000 ~2000 >> 64K writes ~431 ~57 >> 64K reads ~17500 ~180 >> >> >> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB >> ------------------------------------------------------------------- >> IO Pattern XFS (IOPs) Rocksdb (IOPs) >> 4K writes ~1450 ~962 >> 4K reads ~65000 ~1641 >> 64K writes ~431 ~426 >> 64K reads ~17500 ~209 >> >> I guess theoretically lower rocksdb performance can be attributed to >> compaction during writes and merging during reads, but I'm not sure if READs >> are lower by this magnitude. >> However, your results seem to show otherwise. Can you please help us with >> rockdb config and how the rados bench has been run? >> >> Thanks, >> Sushma >> >> -----Original Message----- >> From: Shu, Xinxin [mailto:[email protected]] >> Sent: Sunday, June 22, 2014 6:18 PM >> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil' >> Cc: '[email protected]'; Zhang, Jian >> Subject: RE: [RFC] add rocksdb support >> >> >> Hi all, >> >> We enabled rocksdb as data store in our test setup (10 osds on two servers, >> each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU >> E31280) and have performance tests for xfs, leveldb and rocksdb (use rados >> bench as our test tool), the following chart shows details, for write , >> with small number threads , leveldb performance is lower than the other two >> backends , from 16 threads point , rocksdb perform a little better than xfs >> and leveldb , leveldb and rocksdb perform much better than xfs with higher >> thread number. >> >> xfs leveldb >> rocksdb >> throughtput latency throughtput latency >> throughtput latency >> 1 thread write 84.029 0.048 52.430 0.076 >> 71.920 0.056 >> 2 threads write 166.417 0.048 97.917 0.082 >> 155.148 0.052 >> 4 threads write 304.099 0.052 156.094 0.102 >> 270.461 0.059 >> 8 threads write 323.047 0.099 221.370 0.144 >> 339.455 0.094 >> 16 threads write 295.040 0.216 272.032 0.235 >> 348.849 0.183 >> 32 threads write 324.467 0.394 290.072 0.441 >> 338.103 0.378 >> 64 threads write 313.713 0.812 293.261 0.871 >> 324.603 0.787 >> 1 thread read 75.687 0.053 71.629 0.056 >> 72.526 0.055 >> 2 threads read 182.329 0.044 151.683 0.053 >> 153.125 0.052 >> 4 threads read 320.785 0.050 307.180 0.052 >> 312.016 0.051 >> 8 threads read 504.880 0.063 512.295 0.062 >> 519.683 0.062 >> 16 threads read 477.706 0.134 643.385 0.099 >> 654.149 0.098 >> 32 threads read 517.670 0.247 666.696 0.192 >> 678.480 0.189 >> 64 threads read 516.599 0.495 668.360 0.383 >> 680.673 0.376 >> >> -----Original Message----- >> From: Shu, Xinxin >> Sent: Saturday, June 14, 2014 11:50 AM >> To: Sushma Gurram; Mark Nelson; Sage Weil >> Cc: [email protected]; Zhang, Jian >> Subject: RE: [RFC] add rocksdb support >> >> Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb >> , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so >> if you use 'git submodule update --init' to get rocksdb submodule , It did >> not support autoconf/automake . >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Sushma Gurram >> Sent: Saturday, June 14, 2014 2:52 AM >> To: Shu, Xinxin; Mark Nelson; Sage Weil >> Cc: [email protected]; Zhang, Jian >> Subject: RE: [RFC] add rocksdb support >> >> Hi Xinxin, >> >> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory >> seems to be empty. Do I need toput autoconf/automake in this directory? >> It doesn't seem to have any other source files and compilation fails: >> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or >> directory compilation terminated. >> >> Thanks, >> Sushma >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Shu, Xinxin >> Sent: Monday, June 09, 2014 10:00 PM >> To: Mark Nelson; Sage Weil >> Cc: [email protected]; Zhang, Jian >> Subject: RE: [RFC] add rocksdb support >> >> Hi mark >> >> I have finished development of support of rocksdb submodule, a pull request >> for support of autoconf/automake for rocksdb has been created , you can find >> https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will >> create a pull request for rocksdb submodule support , currently this patch >> can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb . >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Mark Nelson >> Sent: Tuesday, June 10, 2014 1:12 AM >> To: Shu, Xinxin; Sage Weil >> Cc: [email protected]; Zhang, Jian >> Subject: Re: [RFC] add rocksdb support >> >> Hi Xinxin, >> >> On 05/28/2014 05:05 AM, Shu, Xinxin wrote: >>> Hi sage , >>> I will add two configure options to --with-librocksdb-static and >>> --with-librocksdb , with --with-librocksdb-static option , ceph will >>> compile the code that get from ceph repository , with --with-librocksdb >>> option , in case of distro packages for rocksdb , ceph will not compile >>> the rocksdb code , will use pre-installed library. is that ok for you ? >>> >>> since current rocksdb does not support autoconf&automake , I will add >>> autoconf&automake support for rocksdb , but before that , i think we should >>> fork a stable branch (maybe 3.0) for ceph . >> >> I'm looking at testing out the rocksdb support as well, both for the OSD and >> for the monitor based on some issues we've been seeing lately. Any news on >> the 3.0 fork and autoconf/automake support in rocksdb? >> >> Thanks, >> Mark >> >>> >>> -----Original Message----- >>> From: Mark Nelson [mailto:[email protected]] >>> Sent: Wednesday, May 21, 2014 9:06 PM >>> To: Shu, Xinxin; Sage Weil >>> Cc: [email protected]; Zhang, Jian >>> Subject: Re: [RFC] add rocksdb support >>> >>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote: >>>> Hi, sage >>>> >>>> I will add rocksdb submodule into the makefile , currently we want to have >>>> fully performance tests on key-value db backend , both leveldb and >>>> rocksdb. Then optimize on rocksdb performance. >>> >>> I'm definitely interested in any performance tests you do here. Last >>> winter I started doing some fairly high level tests on raw >>> leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with >>> rocksdb as a backend. >>> >>>> >>>> -----Original Message----- >>>> From: Sage Weil [mailto:[email protected]] >>>> Sent: Wednesday, May 21, 2014 9:19 AM >>>> To: Shu, Xinxin >>>> Cc: [email protected] >>>> Subject: Re: [RFC] add rocksdb support >>>> >>>> Hi Xinxin, >>>> >>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that >>>> includes the latest set of patches with the groundwork and your rocksdb >>>> patch. There is also a commit that adds rocksdb as a git submodule. I'm >>>> thinking that, since there aren't any distro packages for rocksdb at this >>>> point, this is going to be the easiest way to make this usable for people. >>>> >>>> If you can wire the submodule into the makefile, we can merge this in so >>>> that rocksdb support is in the ceph.com packages on ceph.com. I suspect >>>> that the distros will prefer to turns this off in favor of separate shared >>>> libs, but they can do this at their option if/when they include rocksdb in >>>> the distro. I think the key is just to have both --with-librockdb and >>>> --with-librocksdb-static (or similar) options so that you can either use >>>> the static or dynamically linked one. >>>> >>>> Has your group done further testing with rocksdb? Anything interesting to >>>> share? >>>> >>>> Thanks! >>>> sage >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>> in the body of a message to [email protected] More majordomo >>>> info at http://vger.kernel.org/majordomo-info.html >>>> >>> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> in the body of a message to [email protected] More majordomo >> info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> in the body of a message to [email protected] More majordomo >> info at http://vger.kernel.org/majordomo-info.html >> >> ________________________________ >> >> PLEASE NOTE: The information contained in this electronic mail message is >> intended only for the use of the designated recipient(s) named above. If the >> reader of this message is not the intended recipient, you are hereby >> notified that you have received this message in error and that any review, >> dissemination, distribution, or copying of this message is strictly >> prohibited. If you have received this communication in error, please notify >> the sender by telephone or e-mail (as shown above) immediately and destroy >> any and all copies of this message in your possession (whether hard copies >> or electronically stored copies). >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> in the body of a message to [email protected] More majordomo >> info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> in the body of a message to [email protected] More majordomo >> info at http://vger.kernel.org/majordomo-info.html > > > > -- > Best Regards, > > Wheat -- Best Regards, Wheat -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
