Excerpts from Alexandre Lécuyer's message of 2017-06-19 11:36:15 +0200: > Hello Clint, > > Thanks for your feedback, replying in the email inline. > > On 06/16/2017 10:54 PM, Clint Byrum wrote: > > Excerpts from John Dickinson's message of 2017-06-16 11:35:39 -0700: > >> On 16 Jun 2017, at 10:51, Clint Byrum wrote: > >> > >>> This is great work. > >>> > >>> I'm sure you've already thought of this, but could you explain why > >>> you've chosen not to put the small objects in the k/v store as part of > >>> the value rather than in secondary large files? > >> I don't want to co-opt an answer from Alex, but I do want to point to some > >> of the other background on this LOSF work. > >> > >> https://wiki.openstack.org/wiki/Swift/ideas/small_files > >> https://wiki.openstack.org/wiki/Swift/ideas/small_files/experimentations > >> https://wiki.openstack.org/wiki/Swift/ideas/small_files/implementation > >> > > These are great. Thanks for sharing them, I understand a lot more now. > > > >> Look at the second link for some context to your answer, but the summary > >> is "that means writing a file system, and writing a file system is really > >> hard". > >> > > I'm not sure we were thinking the same thing. > > > > I was more asking, why not put the content of the object into the k/v > > instead of the big_file_id:offset? My thinking was that for smaller > > objects, you would just return the data immediately upon reading the k/v, > > rather than then needing to go find the big file and read the offset. > > However, I'm painfully aware that those directly involved with the problem > > have likely thought of this. However, the experiments don't seem to show > > that this was attempted. Perhaps I'm zooming too far out to see the real > > problem space. You can all tell me to take my spray paint can and stop > > staring at the bike shed if this is just too annoying. Seriously. > > > > Of course, one important thing is, what does one consider "small"? Seems > > like there's a size where the memory footprint of storing it in the > > k/v would be justifiable if reads just returned immediately from k/v > > vs. needing to also go get data from a big file on disk. Perhaps that > > size is too low to really matter. I was hoping that this had been > > considered and there was documentation, but I don't really see it. > Right, we had considered this when we started the project : storing > small objects directly in the KV. It would not be too diffcult to do, > but we see a few problems : > > 1) consistency > In the current design, we append data at the end of a "big file". When > the data upload is finished, swift writes the metadata and commits the > file. This triggers a fsync(). Only then do we return. We can rely on > the data being stable on disk, even if there is a power loss. Because > we fallocate() space for the "big files" beforehand, we can also hope to > have mostly sequential disk IO. > (Important as most swift clusters use SATA disks). > > Once the object has been committed, we create an entry for it in the KV. > This is done asynchronously, because synchronous writes on the KV kills > performance. If we loose power, we loose the latest data. After the > server is rebooted, we have to scan the end of volumes to create missing > entries in the KV. (I will not discuss this in detail in this email to > keep this short, but we can discuss it in another thread, or I can post > some information on the wiki). > > If we put small objects in the KV, we would need to do synchronous > writes to make sure we don't loose data. > Also, currently we can completly reconstruct the KV from the "big > files". It would not be possible anymore. > > > 2) performance > On our clusters we see about 40% of physical disk IO being caused by > readdir(). > We want to serve directory listing requests from memory. So "small" > means "the KV can fit in the page cache". > We estimate that we need the size per object to be below 50 bytes, which > doesn't leave much room for data. > > LevelDB causes write amplification, as it will regularly copy data to > different files (levels) to keep keys compressed and in sorted order. If > we store object data within the KV, it will be copied around multiple > times as well. > > > Finally it is also more simple to have only one path to handle. Beyond > these issues, it would not be difficult to store data in the KV. This is > something we can revisit after more test and maybe some production > experience. >
Really great explanation. Thanks for sharing. I hope we can all learn from the thorough approach you've taken to this problem. Good luck! > > > > Also the "writing your own filesystem" option in experiments seemed > > more like a thing to do if you left the k/v stores out entirely. > __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
