I have been looking at updating the authentication used by S3FileSystem in
dmlc-core. Current code uses Signature version 2, which works only in the
region us-east-1 now. We need to update the authentication scheme to use
Signature version 4 (SIG4).
I've submitted a PR <https://github.com/dmlc/dmlc-core/pull/378> to change
this for Reads. But I wanted to seek out thoughts on what to do for Writes,
as there is a potential problem.
*How writes to S3 work currently:*
Whenever s3filesystem's stream.write() is called, data is buffered. When
the buffer is full, a request is made to S3. Since this can happen multiple
times, multipart upload feature is used. An upload id is created when
stream is initialized. This upload id is used till the stream is closed.
Default buffer size is 64MB.
The new SIG4 authentication scheme changes how multipart uploads work. Such
an upload now requires that we know the total size of data to be sent (sum
of sizes of all parts) when we create the first request itself. We need to
pass the total size of payload as part of header. This is not possible
given that we don't know all the write calls beforehand. For example, a
call to save model's parameters makes 145 calls to the stream's write.
Is it okay to buffer it to a local file, and then upload this file to S3 at
What use case do we have for writes to S3 generally? I believe we would
want to write params after training or logs. These wouldn't be too large or
frequent I imagine. What would you suggest?
Appreciate your thoughts and suggestions.