Hi, I have done some SQLite footgun elimination at Mozilla, was curious if swift ran into similar issues. >From blog posts like http://blog.maginatics.com/2014/05/13/multi-container-sharding-making-openstack-swift-swifter/ and http://engineering.spilgames.com/openstack-swift-lots-small-files/ it seemed worth looking into.
*Good things* * torgomatic pointed out on IRC that inserts are now batched via an intermediate file that isn't fsync()ed( https://github.com/openstack/swift/commit/85362fdf4e7e70765ba08cee288437a763ea5475). That should help with usecases described by above blog posts. Hope rest of my observations are still of some use. * There are few indexes involved, this is good because indexes in single-file databases are very risky for perf. I setup devstack on my laptop to observe swift performance and poke at the resulting db. I don't have a proper benchmarking environment to check if any of my observations are valid. *Container .db handle LRU* It seems that container DBs are opened once per read/write operation: having container-server keep LRU list of db handles might help workloads with hot containers *Speeding up LIST* * Lack of index for LIST is good, but means LIST will effectively read whole file. * 1024 byte pagesize is used, moving to bigger pagesizes, reduces numer of syscalls ** Firefox moving to 1K->32K cut our DB IO by 1.2-2x http://taras.glek.net/blog/2013/06/28/new-performance-people/ * Doing fadvise(WILL_NEED) on the db file prior to opening it with SQLite should help OS read the db file in at maximum throughput. This causes Linux to issue disk IO in 2mb chunks vs 128K with default readahead settings. SQLite should really do this itself :( * Appends end up fragmenting the db file, should use http://www.sqlite.org/c3ref/c_fcntl_chunk_size.html <http://www.sqlite.org/c3ref/c_fcntl_chunk_size.html#sqlitefcntlchunksize> <http://piratepad.net/ep/search?query=sqlitefcntlchunksize> #sqlitefcntlchunksize <http://www.sqlite.org/c3ref/c_fcntl_chunk_size.html#sqlitefcntlchunksize> to grow DB with less fragmentation OR copy(with fallocate) sqlite file over every time it doubles in size(eg during weekly compaction) ** Fragmentation means db scans are non-sequential on disk ** XFS is particularly susceptible to fragmentation. Can use filefrag on .db files to monitor fragmentation *Write amplification* * write amplification is bad because it causes table scans to be slower than necessary(eg reading less data is always better for cache locality; torgomatic says container dbs can get into gigabytes) * swift uses timestamps in decimal seconds form..eg 1409350185.26144 as a string. I'm guessing these are mainly used for HTTP headers yet HTTP uses seconds, which would normally only take up 4 bytes * CREATE INDEX ix_object_deleted_name ON object (deleted, name) might be a problem for delete-heavy workloads ** SQLite copies column entries used in indexes. Here the index almost doubles amount of space used by deleted entries ** Indexes in general are risky in sqlite, as they end up dispersed with table data until a VACUUM. This causes table scan operations(eg during LIST) to be suboptimal. This could also mean that operations that rely on the index are no better IO-wise than a whole table scan. * deleted is both in content type & deleted field. This might not be a big deal. * Ideally you'd be using a database that can be (lz4?) compressed at a whole-file level. I'm not aware of a good off-the-shelf solution here. Some column store might be a decent replacement for SQLite Hope some of these observations are useful. If not, sorry for the noise. I'm pretty impressed at swift-container's minimalist SQLite usage, did not see many footguns here. Taras
_______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
