Hi Chris, S3 doesn't support append calls. They promote the use of multipart uploads to upload large files in parallel, or when network reliability is an issue. Writing like a stream does not seem to be the purpose of multipart uploads.
I looked into what the AWS SDK does (in Java). It buffers in memory however large the file might be, and then uploads. I imagine this involves reallocating and copying the buffer to the larger buffer. There are few issues raised regarding this on the sdk repos like this <https://github.com/aws/aws-sdk-java/issues/474>. But this doesn't seem to be something the SDKs can do anything about. People seem to be writing to temporary files and then uploading. Regards, Rahul On Tue, Mar 6, 2018 at 9:04 PM, Chris Olivier <cjolivie...@gmail.com> wrote: > it seems strange that s3 would make such a major restriction. there’s > literally no way to incrementally write a file without knowing the size > beforehand? some sort of separate append calls, maybe? > > On Tue, Mar 6, 2018 at 8:53 PM Rahul Huilgol <rahulhuil...@gmail.com> > wrote: > > > Hi everyone, > > > > I have been looking at updating the authentication used by S3FileSystem > in > > dmlc-core. Current code uses Signature version 2, which works only in the > > region us-east-1 now. We need to update the authentication scheme to use > > Signature version 4 (SIG4). > > > > I've submitted a PR <https://github.com/dmlc/dmlc-core/pull/378> to > change > > this for Reads. But I wanted to seek out thoughts on what to do for > Writes, > > as there is a potential problem. > > > > *How writes to S3 work currently:* > > Whenever s3filesystem's stream.write() is called, data is buffered. When > > the buffer is full, a request is made to S3. Since this can happen > multiple > > times, multipart upload feature is used. An upload id is created when > > stream is initialized. This upload id is used till the stream is closed. > > Default buffer size is 64MB. > > > > *Problem:* > > The new SIG4 authentication scheme changes how multipart uploads work. > Such > > an upload now requires that we know the total size of data to be sent > (sum > > of sizes of all parts) when we create the first request itself. We need > to > > pass the total size of payload as part of header. This is not possible > > given that we don't know all the write calls beforehand. For example, a > > call to save model's parameters makes 145 calls to the stream's write. > > > > *Approach?* > > Is it okay to buffer it to a local file, and then upload this file to S3 > at > > the end? > > What use case do we have for writes to S3 generally? I believe we would > > want to write params after training or logs. These wouldn't be too large > or > > frequent I imagine. What would you suggest? > > > > Appreciate your thoughts and suggestions. > > > > Thanks, > > Rahul Huilgol > > > -- Rahul Huilgol