Hello John,

Thanks for your comments! Replying inline

On 06/16/2017 07:06 PM, John Dickinson wrote:
Alex, this is fantastic work and great info. Thanks for sharing it.

Additional comments inline.

On 16 Jun 2017, at 6:54, Alexandre Lécuyer wrote:

Swift stores objects on a regular filesystem (XFS is recommended), one file per 
object. While it works fine for medium or big objects, when you have lots of 
small objects you can run into issues: because of the high count of inodes on 
the object servers, they can’t stay in cache, implying lot of memory usage and 
IO operations to fetch inodes from disk.

In the past few months, we’ve been working on implementing a new storage backend 
in Swift. It is highly inspired by haystack[1]. In a few words, objects are stored 
in big files, and a Key/Value store provides information to locate an object 
(object hash -> big_file_id:offset). As the mapping in the K/V consumes less 
memory than an inode, it is possible to keep all entries in memory, saving many IO 
to locate the object. It also allows some performance improvements by limiting the 
XFS meta updates (e.g.: almost no inode updates as we write objects by using 
fdatasync() instead of fsync())

One of the questions that was raised during discussions about this design is: 
do we want one K/V store per device, or one K/V store per Swift partition (= 
multiple K/V per device). The concern was about failure domain. If the only K/V 
gets corrupted, the whole device must be reconstructed. Memory usage is a major 
point in making a decision, so we did some benchmark.

The key-value store is implemented over LevelDB.
Given a single disk with 20 million files (could be either one object replica 
or one fragment, if using EC)

I have tested three cases :
   - single KV for the whole disk
   - one KV per partition, with 100 partitions per disk
   - one KV per partition, with 1000 partitions per disk

Single KV for the disk :
   - DB size: 750 MB
   - bytes per object: 38

One KV per partition :
Assuming :
   - 100 partitions on the disk (=> 100 KV)
   - 16 bits part power (=> all keys in a given KV will have the same 16 bit 
prefix)

   - 7916 KB per KV, total DB size: 773 MB
   - bytes per object: 41

One KV per partition :
Assuming :
   - 1000 partitions on the disk (=> 1000 KV)
   - 16 bits part power (=> all keys in a given KV will have the same 16 bit 
prefix)

   - 1388 KB per KV, total DB size: 1355 MB total
   - bytes per object: 71

A typical server we use for swift clusters has 36 drives, which gives us :
- Single KV : 26 GB
- Split KV, 100 partitions : 28 GB (+7%)
- Split KV, 1000 partitions : 48 GB (+85%)

So, splitting seems reasonable if you don't have too many partitions.

Same test, with 10 million files instead of 20

- Single KV : 13 GB
- Split KV, 100 partitions : 18 GB (+38%)
- Split KV, 1000 partitions : 24 GB (+85%)


Finally, if we run a full compaction on the DB after the test, you get the
same memory usage in all cases, about 32 bytes per object.

We have not made enough tests to know what would happen in production. LevelDB
does trigger compaction automatically on parts of the DB, but continuous change
means we probably would not reach the smallest possible size.
This is likely a very good assumption (that the KV will continuously change and 
never get to minimum size).

My initial instinct is to go for one KV per drive.

One per partition does sound nice, but it is more sensitive to proper cluster 
configuration and deployment. For example, if an operator were to deploy a 
relatively small cluster but have a part power that's too big for the capacity, 
the KV strategy would end up with many thousands of mostly-empty partitions 
(imagine a 5-node cluster, 60 drives with a part power of 18 -- you're looking 
at more than 13k parts per drive per storage policy). Going for one KV per 
whole drive means that poor ring settings won't impact this area of storage as 
much.
That is also what we think. We will do more testing to confirm that one K/V per disk is stable with many objects under load, and if it does not corrupt when power outages occur. (we will have to recover a little data, but not rebuild the whole K/V).




Beyond the size issue, there are other things to consider :
File descriptors limits : LevelDB seems to keep at least 4 file descriptors 
open during operation.

Having one KV per partition also means you have to move entries between KVs 
when you change the part power. (if we want to support that)
Yes, let's support that (in general)! But doing on KV per drive means it 
already works for this LOSF work.

A compromise may be to split KVs on a small prefix of the object's hash, 
independent of swift's configuration.
This is an interesting idea to explore. It will allow for smaller individual KV 
stores without being as sensitive to the ring parameters.

As you can see we're still thinking about this. Any ideas are welcome !
We will keep you updated about more "real world" testing. Among the tests we 
plan to check how resilient the DB is in case of a power loss.
I'd also be very interested in other tests around concurrent access to the KV 
store. If we've only got one per whole drive, how many concurrent requests can 
be served? If it's a small number, that may be a good incentive to explore that 
other options (either one per partition or one per some additional hash prefix).
So far I have run tests with only about 10 concurrent clients uploading through the openstack API. With random object names, and then with "pre-computed" names to hit a single partition. It works fine, but the concurrency is quickly limited by the disk (fsync() on the "big files" after the object has been written)

I will run tests directly against the RPC service so I can stress the KV more.




Thanks for sharing the ongoing work! I'm really excited about seeing this in 
Swift.

--
Alex



[1]https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to