For overwrite case, I checked height of BTree of my lmdb database, Height is 4, so for "one byte" page update, there should be 4 pages update, plus one meta page update, write amplification should be 5 rather than ~9, let me know if I missed something?
by the way, how can I get the degree of B-Tree of lmdb database? Cheers, xinxin -----Original Message----- From: Леонид Юрьев [mailto:[email protected]] Sent: Tuesday, May 05, 2015 6:16 PM To: Shu, Xinxin Cc: [email protected] Subject: Re: large write amplification Hm, ANY change needs a btree-update. Let have a item key=K, data=A. Then overwrite A to B, so now key=K, data=B. This is a simply "one byte" change, but a few db-pages need to be cloned and updated: - a page, which contains the data=B and records around. - a page in b-tree, that holds a pointer/reference to a page, which contains data=B and records around. - all "leaf-to-root path in btree" pages, related to a new page in btree, that holds a pointer/reference to a page, which contains data=B and records around. - ... - a new root-pages of mainDB and freeDB. - a point to "new root" in meta-page, that lay in the house that Jack built ;) So, by design LMDB is optimized for highload reading, but not for writes. Leonid. 2015-05-05 10:26 GMT+03:00 Shu, Xinxin <[email protected]>: > Hi leonid, > > Thanks for your reply, I observed another scenario , I also tested > "overwrite mode", I slightly modify source code to change default > behavior (set dbflags_ = SYNC, flush data to disk once transaction is > committed ), also collected iostat , the overwrite IOPS is ~ 521 > ops/sec , but iostat show that w/s is ~ 4666, the write amplification > is ~9, to my understanding, overwriting exist value does not adjust > btree, why write amplification is so large, could you help explain ? > thanks > > Cheers, > xinxin > > -----Original Message----- > From: Леонид Юрьев [mailto:[email protected]] > Sent: Monday, May 04, 2015 6:59 PM > To: Shu, Xinxin > Cc: [email protected] > Subject: Re: large write amplification > > Hi, Xinxin. > > I will try to answer briefly, without a details: > > - To allow readers be never blocked by a writer, LMDB provides a snapshot of > data, indexes and directory for each completed transaction. > > - Most of a db-pages (which is not changed by a particular > transaction) are "shared" between such snapshots. But any changes of data > itself and reflection to btree-indexes (include a particular table, free-db, > main-db and so forth) require a new pages to be used and written to the disk. > > - In a large db a small "one-byte" change may make "dirty" a lot of db-pages > (usualy 4K each). For example, one add/del/mod operation in LDAP-db with size > of few GB, requires about 50-100 page-level IOPS. > > Leonid. > > P.S. > For highload uses-cases I made a few changes in our fork of OpenLDAP/LMDB. > A one of these features we called "LIFO reclaiming". > It give us 10-50 times performance boost, especially by engaging benefits of > write-back cache of storage subsystem. > Nowadays we used it in our production (telco) environment. > But currently ones is not safe for all cases, see > https://github.com/ReOpen/ReOpenLDAP/issues/2 and > https://github.com/ReOpen/ReOpenLDAP/issues/1. > > 2015-05-04 5:31 GMT+03:00 Shu, Xinxin <[email protected]>: >> Hi list, >> >> Recently I run micro tests on LMDB on DC3700 (200GB), I use bench >> code https://github.com/hyc/leveldb/tree/benches , I tested fillrandsync >> mode and collected iostat data, found that write amplification is large For >> fillrandsync case: >> >> IOPS : 1020 ops/sec >> >> Iostat data shows that w/s on that SSD is 8093, and avgqu-sz is ~ 1, >> await time is about 0.16 ms, so the write amplification is ~8, which >> is large to me, can someone help explain why write amplification is >> so large? thanks >> >> >> Cheers, >> xinxin >> >>
