On Tue, 1 Jul 2014, Somnath Roy wrote: > Hi Haomai, > But, the cache hit will be very minimal or null, if the actual storage per > node is very huge (say in the PB level). So, it will be mostly hitting Omap, > isn't it ? > How this header cache is going to resolve this serialization issue then ?
The header cache is really important for Transactions that have multiple ops on the same object. But I suspect you're right that for some workloads it won't help with the lock contention you are seeing here. sage > > Thanks & Regards > Somnath > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Haomai Wang > Sent: Monday, June 30, 2014 11:10 PM > To: Sushma Gurram > Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; > [email protected] > Subject: Re: [RFC] add rocksdb support > > Hi Sushma, > > Thanks for your investigations! We already noticed the serializing risk on > GenericObjectMap/DBObjectMap. In order to improve performance we add header > cache to DBObjectMap. > > As for KeyValueStore, a cache branch is on the reviewing, it can greatly > reduce lookup_header calls. Of course, replace with RWLock is a good > suggestion, I would like to try to estimate! > > On Tue, Jul 1, 2014 at 8:39 AM, Sushma Gurram <[email protected]> > wrote: > > Hi Haomai/Greg, > > > > I tried to analyze this a bit more and it appears that the > > GenericObjectMap::header_lock is serializing the READ requests in the > > following path and hence the low performance numbers with KeyValueStore. > > ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() -> > > ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr() -> > > KeyValueStore::getattr() -> GenericObjectMap::get_values() -> > > GenericObjectMap::lookup_header() > > > > I fabricated the code to avoid this lock for a specific run and noticed > > that the performance is similar to FileStore. > > > > In our earlier investigations also we noticed similar serialization issues > > with DBObjectMap::header_lock when rgw stores xattrs in LevelDB. > > > > Can you please help understand the reason for this lock and whether it can > > be replaced with a RWLock or any other suggestions to avoid serialization > > due to this lock? > > > > Thanks, > > Sushma > > > > -----Original Message----- > > From: Haomai Wang [mailto:[email protected]] > > Sent: Friday, June 27, 2014 1:08 AM > > To: Sushma Gurram > > Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; > > [email protected] > > Subject: Re: [RFC] add rocksdb support > > > > As I mentioned days ago: > > > > There exists two points related kvstore perf: > > 1. The order of image and the strip > > size are important to performance. Because the header like inode in fs is > > much lightweight than fd, so the order of image is expected to be lower. > > And strip size can be configurated to 4kb to improve large io performance. > > 2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, > > the header cache is important to perf. It's just like fdcahce in FileStore. > > > > As for detail perf number, I think this result based on master branch is > > nearly correct. When strip-size and header cache are ready, I think it will > > be better. > > > > On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <[email protected]> > > wrote: > >> Delivery failure due to table format. Resending as plain text. > >> > >> _____________________________________________ > >> From: Sushma Gurram > >> Sent: Thursday, June 26, 2014 5:35 PM > >> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil' > >> Cc: 'Zhang, Jian'; [email protected] > >> Subject: RE: [RFC] add rocksdb support > >> > >> > >> Hi Xinxin, > >> > >> Thanks for providing the results of the performance tests. > >> > >> I used fio (with support for rbd ioengine) to compare XFS and RockDB with > >> a single OSD. Also confirmed with rados bench and both numbers seem to be > >> of the same order. > >> My findings show that XFS is better than rocksdb. Can you please let us > >> know rocksdb configuration that you used, object size and duration of run > >> for rados bench? > >> For random writes tests, I see "rocksdb:bg0" thread as the top CPU > >> consumer (%CPU of this thread is 50, while that of all other threads in > >> the OSD is <10% utilized). > >> Is there a ceph.conf config option to configure the background threads in > >> rocksdb? > >> > >> We ran our tests with following configuration: > >> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical > >> cores), HT disabled, 16 GB memory > >> > >> rocksdb configuration has been set to the following values in ceph.conf. > >> rocksdb_write_buffer_size = 4194304 > >> rocksdb_cache_size = 4194304 > >> rocksdb_bloom_size = 0 > >> rocksdb_max_open_files = 10240 > >> rocksdb_compression = false > >> rocksdb_paranoid = false > >> rocksdb_log = /dev/null > >> rocksdb_compact_on_mount = false > >> > >> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, > >> iodepth=32. Unlike rados bench, fio rbd helps to create multiple > >> (=numjobs) client connections to the OSD, thus stressing the OSD. > >> > >> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB > >> ------------------------------------------------------------------- > >> IO Pattern XFS (IOPs) Rocksdb (IOPs) > >> 4K writes ~1450 ~670 > >> 4K reads ~65000 ~2000 > >> 64K writes ~431 ~57 > >> 64K reads ~17500 ~180 > >> > >> > >> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB > >> ------------------------------------------------------------------- > >> IO Pattern XFS (IOPs) Rocksdb (IOPs) > >> 4K writes ~1450 ~962 > >> 4K reads ~65000 ~1641 > >> 64K writes ~431 ~426 > >> 64K reads ~17500 ~209 > >> > >> I guess theoretically lower rocksdb performance can be attributed to > >> compaction during writes and merging during reads, but I'm not sure if > >> READs are lower by this magnitude. > >> However, your results seem to show otherwise. Can you please help us with > >> rockdb config and how the rados bench has been run? > >> > >> Thanks, > >> Sushma > >> > >> -----Original Message----- > >> From: Shu, Xinxin [mailto:[email protected]] > >> Sent: Sunday, June 22, 2014 6:18 PM > >> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil' > >> Cc: '[email protected]'; Zhang, Jian > >> Subject: RE: [RFC] add rocksdb support > >> > >> > >> Hi all, > >> > >> We enabled rocksdb as data store in our test setup (10 osds on two > >> servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) > >> Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and > >> rocksdb (use rados bench as our test tool), the following chart shows > >> details, for write , with small number threads , leveldb performance is > >> lower than the other two backends , from 16 threads point , rocksdb > >> perform a little better than xfs and leveldb , leveldb and rocksdb perform > >> much better than xfs with higher thread number. > >> > >> xfs leveldb > >> rocksdb > >> throughtput latency throughtput latency > >> throughtput latency > >> 1 thread write 84.029 0.048 52.430 0.076 > >> 71.920 0.056 > >> 2 threads write 166.417 0.048 97.917 0.082 > >> 155.148 0.052 > >> 4 threads write 304.099 0.052 156.094 0.102 > >> 270.461 0.059 > >> 8 threads write 323.047 0.099 221.370 0.144 > >> 339.455 0.094 > >> 16 threads write 295.040 0.216 272.032 0.235 > >> 348.849 0.183 > >> 32 threads write 324.467 0.394 290.072 0.441 > >> 338.103 0.378 > >> 64 threads write 313.713 0.812 293.261 0.871 > >> 324.603 0.787 > >> 1 thread read 75.687 0.053 71.629 0.056 > >> 72.526 0.055 > >> 2 threads read 182.329 0.044 151.683 > >> 0.053 153.125 0.052 > >> 4 threads read 320.785 0.050 307.180 > >> 0.052 312.016 0.051 > >> 8 threads read 504.880 0.063 512.295 > >> 0.062 519.683 0.062 > >> 16 threads read 477.706 0.134 643.385 > >> 0.099 654.149 0.098 > >> 32 threads read 517.670 0.247 666.696 > >> 0.192 678.480 0.189 > >> 64 threads read 516.599 0.495 668.360 > >> 0.383 680.673 0.376 > >> > >> -----Original Message----- > >> From: Shu, Xinxin > >> Sent: Saturday, June 14, 2014 11:50 AM > >> To: Sushma Gurram; Mark Nelson; Sage Weil > >> Cc: [email protected]; Zhang, Jian > >> Subject: RE: [RFC] add rocksdb support > >> > >> Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb > >> , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , > >> so if you use 'git submodule update --init' to get rocksdb submodule , It > >> did not support autoconf/automake . > >> > >> -----Original Message----- > >> From: [email protected] > >> [mailto:[email protected]] On Behalf Of Sushma Gurram > >> Sent: Saturday, June 14, 2014 2:52 AM > >> To: Shu, Xinxin; Mark Nelson; Sage Weil > >> Cc: [email protected]; Zhang, Jian > >> Subject: RE: [RFC] add rocksdb support > >> > >> Hi Xinxin, > >> > >> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory > >> seems to be empty. Do I need toput autoconf/automake in this directory? > >> It doesn't seem to have any other source files and compilation fails: > >> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or > >> directory compilation terminated. > >> > >> Thanks, > >> Sushma > >> > >> -----Original Message----- > >> From: [email protected] > >> [mailto:[email protected]] On Behalf Of Shu, Xinxin > >> Sent: Monday, June 09, 2014 10:00 PM > >> To: Mark Nelson; Sage Weil > >> Cc: [email protected]; Zhang, Jian > >> Subject: RE: [RFC] add rocksdb support > >> > >> Hi mark > >> > >> I have finished development of support of rocksdb submodule, a pull > >> request for support of autoconf/automake for rocksdb has been created , > >> you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok > >> , I will create a pull request for rocksdb submodule support , currently > >> this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb . > >> > >> -----Original Message----- > >> From: [email protected] > >> [mailto:[email protected]] On Behalf Of Mark Nelson > >> Sent: Tuesday, June 10, 2014 1:12 AM > >> To: Shu, Xinxin; Sage Weil > >> Cc: [email protected]; Zhang, Jian > >> Subject: Re: [RFC] add rocksdb support > >> > >> Hi Xinxin, > >> > >> On 05/28/2014 05:05 AM, Shu, Xinxin wrote: > >>> Hi sage , > >>> I will add two configure options to --with-librocksdb-static and > >>> --with-librocksdb , with --with-librocksdb-static option , ceph will > >>> compile the code that get from ceph repository , with --with-librocksdb > >>> option , in case of distro packages for rocksdb , ceph will not compile > >>> the rocksdb code , will use pre-installed library. is that ok for you ? > >>> > >>> since current rocksdb does not support autoconf&automake , I will add > >>> autoconf&automake support for rocksdb , but before that , i think we > >>> should fork a stable branch (maybe 3.0) for ceph . > >> > >> I'm looking at testing out the rocksdb support as well, both for the OSD > >> and for the monitor based on some issues we've been seeing lately. Any > >> news on the 3.0 fork and autoconf/automake support in rocksdb? > >> > >> Thanks, > >> Mark > >> > >>> > >>> -----Original Message----- > >>> From: Mark Nelson [mailto:[email protected]] > >>> Sent: Wednesday, May 21, 2014 9:06 PM > >>> To: Shu, Xinxin; Sage Weil > >>> Cc: [email protected]; Zhang, Jian > >>> Subject: Re: [RFC] add rocksdb support > >>> > >>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote: > >>>> Hi, sage > >>>> > >>>> I will add rocksdb submodule into the makefile , currently we want to > >>>> have fully performance tests on key-value db backend , both leveldb and > >>>> rocksdb. Then optimize on rocksdb performance. > >>> > >>> I'm definitely interested in any performance tests you do here. Last > >>> winter I started doing some fairly high level tests on raw > >>> leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see > >>> with rocksdb as a backend. > >>> > >>>> > >>>> -----Original Message----- > >>>> From: Sage Weil [mailto:[email protected]] > >>>> Sent: Wednesday, May 21, 2014 9:19 AM > >>>> To: Shu, Xinxin > >>>> Cc: [email protected] > >>>> Subject: Re: [RFC] add rocksdb support > >>>> > >>>> Hi Xinxin, > >>>> > >>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that > >>>> includes the latest set of patches with the groundwork and your rocksdb > >>>> patch. There is also a commit that adds rocksdb as a git submodule. > >>>> I'm thinking that, since there aren't any distro packages for rocksdb at > >>>> this point, this is going to be the easiest way to make this usable for > >>>> people. > >>>> > >>>> If you can wire the submodule into the makefile, we can merge this in so > >>>> that rocksdb support is in the ceph.com packages on ceph.com. I suspect > >>>> that the distros will prefer to turns this off in favor of separate > >>>> shared libs, but they can do this at their option if/when they include > >>>> rocksdb in the distro. I think the key is just to have both > >>>> --with-librockdb and --with-librocksdb-static (or similar) options so > >>>> that you can either use the static or dynamically linked one. > >>>> > >>>> Has your group done further testing with rocksdb? Anything interesting > >>>> to share? > >>>> > >>>> Thanks! > >>>> sage > >>>> > >>>> -- > >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >>>> in the body of a message to [email protected] More > >>>> majordomo info at http://vger.kernel.org/majordomo-info.html > >>>> > >>> > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >> in the body of a message to [email protected] More majordomo > >> info at http://vger.kernel.org/majordomo-info.html > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >> in the body of a message to [email protected] More majordomo > >> info at http://vger.kernel.org/majordomo-info.html > >> > >> ________________________________ > >> > >> PLEASE NOTE: The information contained in this electronic mail message is > >> intended only for the use of the designated recipient(s) named above. If > >> the reader of this message is not the intended recipient, you are hereby > >> notified that you have received this message in error and that any review, > >> dissemination, distribution, or copying of this message is strictly > >> prohibited. If you have received this communication in error, please > >> notify the sender by telephone or e-mail (as shown above) immediately and > >> destroy any and all copies of this message in your possession (whether > >> hard copies or electronically stored copies). > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >> in the body of a message to [email protected] More majordomo > >> info at http://vger.kernel.org/majordomo-info.html > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >> in the body of a message to [email protected] More majordomo > >> info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > Best Regards, > > > > Wheat > > > > -- > Best Regards, > > Wheat > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to [email protected] More majordomo info at > http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
