it seems strange that s3 would make such a major restriction. there’s literally no way to incrementally write a file without knowing the size beforehand? some sort of separate append calls, maybe?
On Tue, Mar 6, 2018 at 8:53 PM Rahul Huilgol <rahulhuil...@gmail.com> wrote: > Hi everyone, > > I have been looking at updating the authentication used by S3FileSystem in > dmlc-core. Current code uses Signature version 2, which works only in the > region us-east-1 now. We need to update the authentication scheme to use > Signature version 4 (SIG4). > > I've submitted a PR <https://github.com/dmlc/dmlc-core/pull/378> to change > this for Reads. But I wanted to seek out thoughts on what to do for Writes, > as there is a potential problem. > > *How writes to S3 work currently:* > Whenever s3filesystem's stream.write() is called, data is buffered. When > the buffer is full, a request is made to S3. Since this can happen multiple > times, multipart upload feature is used. An upload id is created when > stream is initialized. This upload id is used till the stream is closed. > Default buffer size is 64MB. > > *Problem:* > The new SIG4 authentication scheme changes how multipart uploads work. Such > an upload now requires that we know the total size of data to be sent (sum > of sizes of all parts) when we create the first request itself. We need to > pass the total size of payload as part of header. This is not possible > given that we don't know all the write calls beforehand. For example, a > call to save model's parameters makes 145 calls to the stream's write. > > *Approach?* > Is it okay to buffer it to a local file, and then upload this file to S3 at > the end? > What use case do we have for writes to S3 generally? I believe we would > want to write params after training or logs. These wouldn't be too large or > frequent I imagine. What would you suggest? > > Appreciate your thoughts and suggestions. > > Thanks, > Rahul Huilgol >