Hi Chris,

S3 doesn't support append calls. They promote the use of multipart uploads
to upload large files in parallel, or when network reliability is an issue.
Writing like a stream does not seem to be the purpose of multipart uploads.

I looked into what the AWS SDK does (in Java). It buffers in memory however
large the file might be, and then uploads. I imagine this involves
reallocating and copying the buffer to the larger buffer. There are few
issues raised regarding this on the sdk repos like this
<https://github.com/aws/aws-sdk-java/issues/474>. But this doesn't seem to
be something the SDKs can do anything about. People seem to be writing to
temporary files and then uploading.


On Tue, Mar 6, 2018 at 9:04 PM, Chris Olivier <cjolivie...@gmail.com> wrote:

> it seems strange that s3 would make such a major restriction. there’s
> literally no way to incrementally write a file without knowing the size
> beforehand? some sort of separate append calls, maybe?
> On Tue, Mar 6, 2018 at 8:53 PM Rahul Huilgol <rahulhuil...@gmail.com>
> wrote:
> > Hi everyone,
> >
> > I have been looking at updating the authentication used by S3FileSystem
> in
> > dmlc-core. Current code uses Signature version 2, which works only in the
> > region us-east-1 now. We need to update the authentication scheme to use
> > Signature version 4 (SIG4).
> >
> > I've submitted a PR <https://github.com/dmlc/dmlc-core/pull/378> to
> change
> > this for Reads. But I wanted to seek out thoughts on what to do for
> Writes,
> > as there is a potential problem.
> >
> > *How writes to S3 work currently:*
> > Whenever s3filesystem's stream.write() is called, data is buffered. When
> > the buffer is full, a request is made to S3. Since this can happen
> multiple
> > times, multipart upload feature is used. An upload id is created when
> > stream is initialized. This upload id is used till the stream is closed.
> > Default buffer size is 64MB.
> >
> > *Problem:*
> > The new SIG4 authentication scheme changes how multipart uploads work.
> Such
> > an upload now requires that we know the total size of data to be sent
> (sum
> > of sizes of all parts) when we create the first request itself. We need
> to
> > pass the total size of payload as part of header. This is not possible
> > given that we don't know all the write calls beforehand. For example, a
> > call to save model's parameters makes 145 calls to the stream's write.
> >
> > *Approach?*
> > Is it okay to buffer it to a local file, and then upload this file to S3
> at
> > the end?
> > What use case do we have for writes to S3 generally? I believe we would
> > want to write params after training or logs. These wouldn't be too large
> or
> > frequent I imagine. What would you suggest?
> >
> > Appreciate your thoughts and suggestions.
> >
> > Thanks,
> > Rahul Huilgol
> >

Rahul Huilgol

Reply via email to