Re: [ceph-users] WAL/DB size

Mark Nelson Thu, 15 Aug 2019 02:18:35 -0700

Hi Folks,

The basic idea behind the WAL is that for every DB write transaction youfirst write it into an in-memory buffer and to a region on disk. RocksDB typically is setup to have multiple WAL buffers, and when one ormore fills up, it will start flushing the data to L0 while new writesare written to the next buffer. If rocksdb can't flush data fastenough, it will throttle write throughput down so that hopefully youdon't fill all of of the buffers up and stall before a flush completes. The combined total size/number of buffers governs both how much diskspace you need for the WAL and how much RAM is needed to store incomingIO that hasn't finished flushing into the DB. There are varioustradeoffs when adjust the size, number, and behavior of the WAL. On onehand there's an advantage to having small buffers to favor frequentswift flush events and hopefully keep overall memory usage low and CPUoverhead of key comparisons low. On the other hand, having large WALbuffers means you have more runway both in terms of being able to absorblonger L0 compaction events but also potentially in terms of being ableto avoid writing pglog entries to L0 entirely if a tombstone lands inthe same WAL buffer as the initial write. We've seen evidence thatwrite amplification is (sometimes much) lower with bigger WAL buffersand we think this is a big part of the reason why.



Right now our default WAL settings for rocksdb is:


max_write_buffer_number=4

min_write_buffer_number_to_merge=1

write_buffer_size=268435456

which means we will store up to 4 256MB buffers and start flushing assoon as 1 fills up. Alternate strategies could be to something like 1664MB buffers, and set min_write_buffer_number_to_merge to something like4. Potentially that might provide slightly more fine grained controland also may be advantageous with a larger number of column families,but we haven't seen evidence yet that splitting the buffers into moresmaller segments definitely improves things. Probably the biggertake-away is that you can't simply make the WAL huge to give yourselfextra runway for writes unless you are also willing to eat the RAM costof storing all of that data in-memory as well. That's one of the reasonswhy we tell people regularly that 1-2GB is enough for the WAL. With atarget OSD memory of 4GB, (up to) 1GB for the WAL is already pushingit. Luckily in most cases it doesn't actually use the full 1GB though. RocksDB will throttle before you get to that point so in reality it'smore likely the WAL is probably using more like 0-512MB of Disk/RAM with2-3 extra buffers of capacity in case things get hairy.



Mark


On 8/15/19 1:59 AM, Janne Johansson wrote:

Den tors 15 aug. 2019 kl 00:16 skrev Anthony D'Atri<[email protected] <mailto:[email protected]>>:
    Good points in both posts, but I think there’s still some unclarity.


...

    We’ve seen good explanations on the list of why only specific DB
    sizes, say 30GB, are actually used _for the DB_.
    If the WAL goes along with the DB, shouldn’t we also explicitly
    determine an appropriate size N for the WAL, and make the
    partition (30+N) GB?
    If so, how do we derive N?  Or is it a constant?

    Filestore was so much simpler, 10GB set+forget for the journal. 
    Not that I miss XFS, mind you.
But we got a simple handwaving-best-effort-guesstimate that went "WAL1GB is fine, yes." so there you have an N you can use for the
30+N or 60+N sizings.
Can't see how that N needs more science than the filestore N=10G youshowed. Not that I think journal=10G was wrong or anything.
--
May the most significant bit of your life be positive.

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] WAL/DB size

Reply via email to