RE: large write amplification

Shu, Xinxin Wed, 06 May 2015 23:42:03 -0700

For overwrite case, I checked height of BTree of my lmdb database, Height is 4, 
so for "one byte" page update, there should be 4 pages update,  plus one meta 
page update, write amplification should be 5 rather than ~9,  let me know if I  
missed something?


by the way, how can I get the degree of B-Tree of lmdb database?

Cheers,
xinxin

-----Original Message-----
From: Леонид Юрьев [mailto:[email protected]] 
Sent: Tuesday, May 05, 2015 6:16 PM
To: Shu, Xinxin
Cc: [email protected]
Subject: Re: large write amplification

Hm, ANY change needs a btree-update.

Let have a item key=K, data=A.
Then overwrite A to B, so now key=K, data=B.

This is a simply "one byte" change, but a few db-pages need to be cloned and 
updated:
- a page, which contains the data=B and records around.
- a page in b-tree, that holds a pointer/reference to a page, which contains 
data=B and records around.
- all "leaf-to-root path in btree" pages, related to a new page in btree, that 
holds a pointer/reference to a page, which contains data=B and records around.
-  ...
- a new root-pages of mainDB and freeDB.
- a point to "new root" in meta-page, that lay in the house that Jack built ;)

So, by design LMDB is optimized for highload reading, but not for writes.

Leonid.

2015-05-05 10:26 GMT+03:00 Shu, Xinxin <[email protected]>:
> Hi leonid,
>
> Thanks for your reply, I observed another scenario , I also tested 
> "overwrite mode", I slightly modify source code to change default 
> behavior (set dbflags_ = SYNC, flush data to disk once transaction is 
> committed ), also collected iostat , the overwrite IOPS is ~ 521 
> ops/sec , but iostat show that w/s is ~ 4666,  the write amplification  
> is ~9,  to my understanding, overwriting exist value does not adjust 
> btree,  why write amplification is so large, could you help explain ? 
> thanks
>
> Cheers,
> xinxin
>
> -----Original Message-----
> From: Леонид Юрьев [mailto:[email protected]]
> Sent: Monday, May 04, 2015 6:59 PM
> To: Shu, Xinxin
> Cc: [email protected]
> Subject: Re: large write amplification
>
> Hi, Xinxin.
>
> I will try to answer briefly, without a details:
>
> - To allow readers be never blocked by a writer, LMDB provides a snapshot of 
> data, indexes and directory for each completed transaction.
>
> - Most of a db-pages (which is not changed by a particular
> transaction) are "shared" between such snapshots. But any changes of data 
> itself and reflection to btree-indexes (include a particular table, free-db, 
> main-db and so forth) require a new pages to be used and written to the disk.
>
> - In a large db a small "one-byte" change may make "dirty" a lot of db-pages 
> (usualy 4K each). For example, one add/del/mod operation in LDAP-db with size 
> of few GB,  requires about 50-100 page-level IOPS.
>
> Leonid.
>
> P.S.
> For highload uses-cases I made a few changes in our fork of OpenLDAP/LMDB.
> A one of these features we called "LIFO reclaiming".
> It give us 10-50 times performance boost, especially by engaging benefits of 
> write-back cache of storage subsystem.
> Nowadays we used it in our production (telco) environment.
> But currently ones is not safe for all cases, see
> https://github.com/ReOpen/ReOpenLDAP/issues/2 and 
> https://github.com/ReOpen/ReOpenLDAP/issues/1.
>
> 2015-05-04 5:31 GMT+03:00 Shu, Xinxin <[email protected]>:
>> Hi list,
>>
>> Recently I run micro tests on LMDB on DC3700 (200GB), I use bench 
>> code https://github.com/hyc/leveldb/tree/benches ,  I tested  fillrandsync 
>> mode and collected iostat data, found that write amplification is large For 
>> fillrandsync case:
>>
>> IOPS : 1020 ops/sec
>>
>> Iostat data shows that w/s on that SSD is 8093, and avgqu-sz is ~ 1, 
>> await time is about 0.16 ms,  so the write amplification is ~8, which 
>> is large to me, can someone help explain why write amplification is 
>> so large? thanks
>>
>>
>> Cheers,
>> xinxin
>>
>>

RE: large write amplification

Reply via email to