This is great work. I'm sure you've already thought of this, but could you explain why you've chosen not to put the small objects in the k/v store as part of the value rather than in secondary large files?
Excerpts from Alexandre Lécuyer's message of 2017-06-16 15:54:08 +0200: > Swift stores objects on a regular filesystem (XFS is recommended), one file > per object. While it works fine for medium or big objects, when you have lots > of small objects you can run into issues: because of the high count of inodes > on the object servers, they can’t stay in cache, implying lot of memory usage > and IO operations to fetch inodes from disk. > > In the past few months, we’ve been working on implementing a new storage > backend in Swift. It is highly inspired by haystack[1]. In a few words, > objects are stored in big files, and a Key/Value store provides information > to locate an object (object hash -> big_file_id:offset). As the mapping in > the K/V consumes less memory than an inode, it is possible to keep all > entries in memory, saving many IO to locate the object. It also allows some > performance improvements by limiting the XFS meta updates (e.g.: almost no > inode updates as we write objects by using fdatasync() instead of fsync()) > > One of the questions that was raised during discussions about this design is: > do we want one K/V store per device, or one K/V store per Swift partition (= > multiple K/V per device). The concern was about failure domain. If the only > K/V gets corrupted, the whole device must be reconstructed. Memory usage is a > major point in making a decision, so we did some benchmark. > > The key-value store is implemented over LevelDB. > Given a single disk with 20 million files (could be either one object replica > or one fragment, if using EC) > > I have tested three cases : > - single KV for the whole disk > - one KV per partition, with 100 partitions per disk > - one KV per partition, with 1000 partitions per disk > > Single KV for the disk : > - DB size: 750 MB > - bytes per object: 38 > > One KV per partition : > Assuming : > - 100 partitions on the disk (=> 100 KV) > - 16 bits part power (=> all keys in a given KV will have the same 16 bit > prefix) > > - 7916 KB per KV, total DB size: 773 MB > - bytes per object: 41 > > One KV per partition : > Assuming : > - 1000 partitions on the disk (=> 1000 KV) > - 16 bits part power (=> all keys in a given KV will have the same 16 bit > prefix) > > - 1388 KB per KV, total DB size: 1355 MB total > - bytes per object: 71 > > > A typical server we use for swift clusters has 36 drives, which gives us : > - Single KV : 26 GB > - Split KV, 100 partitions : 28 GB (+7%) > - Split KV, 1000 partitions : 48 GB (+85%) > > So, splitting seems reasonable if you don't have too many partitions. > > Same test, with 10 million files instead of 20 > > - Single KV : 13 GB > - Split KV, 100 partitions : 18 GB (+38%) > - Split KV, 1000 partitions : 24 GB (+85%) > > > Finally, if we run a full compaction on the DB after the test, you get the > same memory usage in all cases, about 32 bytes per object. > > We have not made enough tests to know what would happen in production. LevelDB > does trigger compaction automatically on parts of the DB, but continuous > change > means we probably would not reach the smallest possible size. > > > Beyond the size issue, there are other things to consider : > File descriptors limits : LevelDB seems to keep at least 4 file descriptors > open during operation. > > Having one KV per partition also means you have to move entries between KVs > when you change the part power. (if we want to support that) > > A compromise may be to split KVs on a small prefix of the object's hash, > independent of swift's configuration. > > As you can see we're still thinking about this. Any ideas are welcome ! > We will keep you updated about more "real world" testing. Among the tests we > plan to check how resilient the DB is in case of a power loss. > __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
