On 16 Jun 2017, at 10:51, Clint Byrum wrote:

> This is great work.
>
> I'm sure you've already thought of this, but could you explain why
> you've chosen not to put the small objects in the k/v store as part of
> the value rather than in secondary large files?

I don't want to co-opt an answer from Alex, but I do want to point to some of 
the other background on this LOSF work.

https://wiki.openstack.org/wiki/Swift/ideas/small_files
https://wiki.openstack.org/wiki/Swift/ideas/small_files/experimentations
https://wiki.openstack.org/wiki/Swift/ideas/small_files/implementation

Look at the second link for some context to your answer, but the summary is 
"that means writing a file system, and writing a file system is really hard".

--John



>
> Excerpts from Alexandre Lécuyer's message of 2017-06-16 15:54:08 +0200:
>> Swift stores objects on a regular filesystem (XFS is recommended), one file 
>> per object. While it works fine for medium or big objects, when you have 
>> lots of small objects you can run into issues: because of the high count of 
>> inodes on the object servers, they can’t stay in cache, implying lot of 
>> memory usage and IO operations to fetch inodes from disk.
>>
>> In the past few months, we’ve been working on implementing a new storage 
>> backend in Swift. It is highly inspired by haystack[1]. In a few words, 
>> objects are stored in big files, and a Key/Value store provides information 
>> to locate an object (object hash -> big_file_id:offset). As the mapping in 
>> the K/V consumes less memory than an inode, it is possible to keep all 
>> entries in memory, saving many IO to locate the object. It also allows some 
>> performance improvements by limiting the XFS meta updates (e.g.: almost no 
>> inode updates as we write objects by using fdatasync() instead of fsync())
>>
>> One of the questions that was raised during discussions about this design 
>> is: do we want one K/V store per device, or one K/V store per Swift 
>> partition (= multiple K/V per device). The concern was about failure domain. 
>> If the only K/V gets corrupted, the whole device must be reconstructed. 
>> Memory usage is a major point in making a decision, so we did some benchmark.
>>
>> The key-value store is implemented over LevelDB.
>> Given a single disk with 20 million files (could be either one object 
>> replica or one fragment, if using EC)
>>
>> I have tested three cases :
>>    - single KV for the whole disk
>>    - one KV per partition, with 100 partitions per disk
>>    - one KV per partition, with 1000 partitions per disk
>>
>> Single KV for the disk :
>>    - DB size: 750 MB
>>    - bytes per object: 38
>>
>> One KV per partition :
>> Assuming :
>>    - 100 partitions on the disk (=> 100 KV)
>>    - 16 bits part power (=> all keys in a given KV will have the same 16 bit 
>> prefix)
>>
>>    - 7916 KB per KV, total DB size: 773 MB
>>    - bytes per object: 41
>>
>> One KV per partition :
>> Assuming :
>>    - 1000 partitions on the disk (=> 1000 KV)
>>    - 16 bits part power (=> all keys in a given KV will have the same 16 bit 
>> prefix)
>>
>>    - 1388 KB per KV, total DB size: 1355 MB total
>>    - bytes per object: 71
>>
>>
>> A typical server we use for swift clusters has 36 drives, which gives us :
>> - Single KV : 26 GB
>> - Split KV, 100 partitions : 28 GB (+7%)
>> - Split KV, 1000 partitions : 48 GB (+85%)
>>
>> So, splitting seems reasonable if you don't have too many partitions.
>>
>> Same test, with 10 million files instead of 20
>>
>> - Single KV : 13 GB
>> - Split KV, 100 partitions : 18 GB (+38%)
>> - Split KV, 1000 partitions : 24 GB (+85%)
>>
>>
>> Finally, if we run a full compaction on the DB after the test, you get the
>> same memory usage in all cases, about 32 bytes per object.
>>
>> We have not made enough tests to know what would happen in production. 
>> LevelDB
>> does trigger compaction automatically on parts of the DB, but continuous 
>> change
>> means we probably would not reach the smallest possible size.
>>
>>
>> Beyond the size issue, there are other things to consider :
>> File descriptors limits : LevelDB seems to keep at least 4 file descriptors 
>> open during operation.
>>
>> Having one KV per partition also means you have to move entries between KVs 
>> when you change the part power. (if we want to support that)
>>
>> A compromise may be to split KVs on a small prefix of the object's hash, 
>> independent of swift's configuration.
>>
>> As you can see we're still thinking about this. Any ideas are welcome !
>> We will keep you updated about more "real world" testing. Among the tests we 
>> plan to check how resilient the DB is in case of a power loss.
>>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Attachment: signature.asc
Description: OpenPGP digital signature

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to