Re: [Openstack] [Swift]Support reading from archives

Samuel Merritt Wed, 19 Feb 2014 16:26:29 -0800

I have several reasons for not wanting that functionality in Swift; I'lltry to enumerate them here.

First, this represents a significant scope expansion. As of today, Swiftdeals only in opaque byte streams, and any operation applies equally toall byte streams. This change would, for the first time, add knowledgeof _what the bytes mean_ into Swift.

Now, it may be possible to justify this one expansion for this one usecase, but that opens the door to all sorts of crazy stuff. This patchwould add support for zip archives; what about ar or tar or rar? Deb orRPM files? In five years' time, will Swift know how to extract the EXIFmetadata from a JPEG file, or pull an individual worksheet out of anExcel file as CSV? Once Swift starts knowing what the bytes mean, itbecomes much harder to keep the scope from growing and growing.

Second, this adds another difference between normal objects and large(multi-segment) objects. With this patch, if I upload a zip file as oneobject, I can extract individual files from the archive. If I upload itin several segments and use a large-object manifest to tie themtogether, I cannot extract individual files from the archive. Generallyspeaking, normal and large objects should work as identically aspossible. Behavior differences that can be resolved should be resolved,and I don't want to add any more.

Third, this patch assumes that the file on disk has the same contents asthe object that was stored. Today, of course, Swift uses only replicatedstorage so that assumption holds true. However, there's work in progressfor erasure-coded storage of objects, in which case no single objectserver would have the whole zip archive on its local filesystem. Muchlike I don't want differences between normal and large objects, I alsodon't want differences between normal objects that depend on how thedata is stored.

It could be an interesting idea, but it would have to be done asproxy-server middleware for large-object/erasure-code compatibility, andit would have to live outside the Swift source tree.



On 2/19/14 1:43 PM, Vyacheslav Rafalskiy wrote:

Hi all,

This is an attempt to activate the discussion of the following patch,
which introduces support for reading from archives:

https://review.openstack.org/#q,topic:bp/read-from-archives,n,z

Some comments are already reflected in the patch (thanks Christian
Schwede and Michael Barton), see also Discussion below.

Motivation
----------

Currently Swift is not optimal for storing billions of small files. This
is a consequence of the fact that every object in Swift is a file on the
underlying file system (not counting the replicas). Every file requires
its metadata to be loaded into memory before it can be processed.
Metadata is normally cached by the file system but when the total number
of files is too large and access is fairly random caching no longer
works and performance quickly degrades. The Swift's container or tenant
catalogs held in sqlite databases don't offer stellar performance either
when the number of items in them goes into millions.

An alternative for this use case could be a database such as HBase or
Cassandra, which know how to deal with BLOBs. Databases have their ways
to aggregate data in large files and then find it when necessary.
However, database-as-storage have their own problems, one of which is
added complexity.

The above patch offers a way around the Swift's limitation for one
specific but important use case:
  1. one needs to store many small(ish) files, say 1-100KB, which when
stored separately cause performance degradation
  2. these files don't change (too often) such as in data warehouse
  3. a random access to the files is necessary

Solution
--------

The suggested solution is to aggregate the small files in archives, such
as zip or tar, of reasonable size. The archives can only be written as a
whole. They can, of course, be read as a whole with the existing Swift's
GET command like (pseudocode):

GET /tenant/container/big_file.zip

The patch modifies the behavior of the command if additional parameters
are present, for example:

GET /tenant/container/big_file.zip?as=zip&list_content
will result in plain/text response with a list of files in the zip

GET
/tenant/container/big_file.zip?as=zip&get_content=content_file1.png,content_file2.bin
will bring a multipart response with the requested files as binary
attachments

The additional GET functionality must be activated in the config file or
there will be no change in Swift's behavior.

The total size of attachments is limited to prevent "explosion" attack
when decompressing files.


Discussion
----------

Some concerns were raised:
1. Decompression can put a significant additional load on object server
True.
To mitigate on the client side: store files in archive rather than
compress them. You can pre-compress them before storing.
If a consern to service provider: do not activate the feature

2. The response should be streamed rather than provided as a whole
I don't think so.
If you follow the use case, the total size of the archive should be
"reasonable", meaning not too small. However, if your archives are
larger than a couple of megabytes you are doing it wrong. A get_content
request would normally include only a small portion of the archive so no
streaming is necessary.


TODO
----
Tests

Thanks,
Vyacheslav



_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to     : [email protected]
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack



_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to     : [email protected]
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

Re: [Openstack] [Swift]Support reading from archives

Reply via email to