Hello Clint,

Thanks for your feedback, replying in the email inline.

On 06/16/2017 10:54 PM, Clint Byrum wrote:
Excerpts from John Dickinson's message of 2017-06-16 11:35:39 -0700:
On 16 Jun 2017, at 10:51, Clint Byrum wrote:

This is great work.

I'm sure you've already thought of this, but could you explain why
you've chosen not to put the small objects in the k/v store as part of
the value rather than in secondary large files?
I don't want to co-opt an answer from Alex, but I do want to point to some of 
the other background on this LOSF work.

https://wiki.openstack.org/wiki/Swift/ideas/small_files
https://wiki.openstack.org/wiki/Swift/ideas/small_files/experimentations
https://wiki.openstack.org/wiki/Swift/ideas/small_files/implementation

These are great. Thanks for sharing them, I understand a lot more now.

Look at the second link for some context to your answer, but the summary is "that 
means writing a file system, and writing a file system is really hard".

I'm not sure we were thinking the same thing.

I was more asking, why not put the content of the object into the k/v
instead of the big_file_id:offset? My thinking was that for smaller
objects, you would just return the data immediately upon reading the k/v,
rather than then needing to go find the big file and read the offset.
However, I'm painfully aware that those directly involved with the problem
have likely thought of this. However, the experiments don't seem to show
that this was attempted. Perhaps I'm zooming too far out to see the real
problem space. You can all tell me to take my spray paint can and stop
staring at the bike shed if this is just too annoying. Seriously.

Of course, one important thing is, what does one consider "small"? Seems
like there's a size where the memory footprint of storing it in the
k/v would be justifiable if reads just returned immediately from k/v
vs. needing to also go get data from a big file on disk. Perhaps that
size is too low to really matter. I was hoping that this had been
considered and there was documentation, but I don't really see it.
Right, we had considered this when we started the project : storing small objects directly in the KV. It would not be too diffcult to do, but we see a few problems :

1) consistency
In the current design, we append data at the end of a "big file". When the data upload is finished, swift writes the metadata and commits the file. This triggers a fsync(). Only then do we return. We can rely on the data being stable on disk, even if there is a power loss. Because we fallocate() space for the "big files" beforehand, we can also hope to have mostly sequential disk IO.
(Important as most swift clusters use SATA disks).

Once the object has been committed, we create an entry for it in the KV. This is done asynchronously, because synchronous writes on the KV kills performance. If we loose power, we loose the latest data. After the server is rebooted, we have to scan the end of volumes to create missing entries in the KV. (I will not discuss this in detail in this email to keep this short, but we can discuss it in another thread, or I can post some information on the wiki).

If we put small objects in the KV, we would need to do synchronous writes to make sure we don't loose data. Also, currently we can completly reconstruct the KV from the "big files". It would not be possible anymore.


2) performance
On our clusters we see about 40% of physical disk IO being caused by readdir(). We want to serve directory listing requests from memory. So "small" means "the KV can fit in the page cache". We estimate that we need the size per object to be below 50 bytes, which doesn't leave much room for data.

LevelDB causes write amplification, as it will regularly copy data to different files (levels) to keep keys compressed and in sorted order. If we store object data within the KV, it will be copied around multiple times as well.


Finally it is also more simple to have only one path to handle. Beyond these issues, it would not be difficult to store data in the KV. This is something we can revisit after more test and maybe some production experience.


Also the "writing your own filesystem" option in experiments seemed
more like a thing to do if you left the k/v stores out entirely.





__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to