This is great work.

I'm sure you've already thought of this, but could you explain why
you've chosen not to put the small objects in the k/v store as part of
the value rather than in secondary large files?

Excerpts from Alexandre Lécuyer's message of 2017-06-16 15:54:08 +0200:
> Swift stores objects on a regular filesystem (XFS is recommended), one file 
> per object. While it works fine for medium or big objects, when you have lots 
> of small objects you can run into issues: because of the high count of inodes 
> on the object servers, they can’t stay in cache, implying lot of memory usage 
> and IO operations to fetch inodes from disk.
> 
> In the past few months, we’ve been working on implementing a new storage 
> backend in Swift. It is highly inspired by haystack[1]. In a few words, 
> objects are stored in big files, and a Key/Value store provides information 
> to locate an object (object hash -> big_file_id:offset). As the mapping in 
> the K/V consumes less memory than an inode, it is possible to keep all 
> entries in memory, saving many IO to locate the object. It also allows some 
> performance improvements by limiting the XFS meta updates (e.g.: almost no 
> inode updates as we write objects by using fdatasync() instead of fsync())
> 
> One of the questions that was raised during discussions about this design is: 
> do we want one K/V store per device, or one K/V store per Swift partition (= 
> multiple K/V per device). The concern was about failure domain. If the only 
> K/V gets corrupted, the whole device must be reconstructed. Memory usage is a 
> major point in making a decision, so we did some benchmark.
> 
> The key-value store is implemented over LevelDB.
> Given a single disk with 20 million files (could be either one object replica 
> or one fragment, if using EC)
> 
> I have tested three cases :
>    - single KV for the whole disk
>    - one KV per partition, with 100 partitions per disk
>    - one KV per partition, with 1000 partitions per disk
> 
> Single KV for the disk :
>    - DB size: 750 MB
>    - bytes per object: 38
> 
> One KV per partition :
> Assuming :
>    - 100 partitions on the disk (=> 100 KV)
>    - 16 bits part power (=> all keys in a given KV will have the same 16 bit 
> prefix)
> 
>    - 7916 KB per KV, total DB size: 773 MB
>    - bytes per object: 41
> 
> One KV per partition :
> Assuming :
>    - 1000 partitions on the disk (=> 1000 KV)
>    - 16 bits part power (=> all keys in a given KV will have the same 16 bit 
> prefix)
> 
>    - 1388 KB per KV, total DB size: 1355 MB total
>    - bytes per object: 71
>    
> 
> A typical server we use for swift clusters has 36 drives, which gives us :
> - Single KV : 26 GB
> - Split KV, 100 partitions : 28 GB (+7%)
> - Split KV, 1000 partitions : 48 GB (+85%)
> 
> So, splitting seems reasonable if you don't have too many partitions.
> 
> Same test, with 10 million files instead of 20
> 
> - Single KV : 13 GB
> - Split KV, 100 partitions : 18 GB (+38%)
> - Split KV, 1000 partitions : 24 GB (+85%)
> 
> 
> Finally, if we run a full compaction on the DB after the test, you get the
> same memory usage in all cases, about 32 bytes per object.
> 
> We have not made enough tests to know what would happen in production. LevelDB
> does trigger compaction automatically on parts of the DB, but continuous 
> change
> means we probably would not reach the smallest possible size.
> 
> 
> Beyond the size issue, there are other things to consider :
> File descriptors limits : LevelDB seems to keep at least 4 file descriptors 
> open during operation.
> 
> Having one KV per partition also means you have to move entries between KVs 
> when you change the part power. (if we want to support that)
> 
> A compromise may be to split KVs on a small prefix of the object's hash, 
> independent of swift's configuration.
> 
> As you can see we're still thinking about this. Any ideas are welcome !
> We will keep you updated about more "real world" testing. Among the tests we 
> plan to check how resilient the DB is in case of a power loss.
> 

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to