Re: Show: Amazon S3 Sink

Kevin Sookocheff Thu, 11 Feb 2016 18:39:38 -0800

Thanks for the response Dan. While I've been writing this prototype almost
all of your questions also came to my mind.



>    In general, it's unfortunate that -- as the code is set up right now --
> it made sense for you to make an "s3 sink" vs a "s3 channel
> interface".
>

I would have liked to use the IOChannel and IOUtils as well but currently
they heavily favour GCS and there isn't a clear path to break that
dependency (at least for someone as unfamiliar with that code as me). It
actually had me question ... is Sink the right way to do this or not? Room
for improvement and clarification in the SDK.


> *) Another thing we'd like to do better in Beam is handling of credentials.
> Suppose I want to read two different files from S3, using two different
> sets of credentials. The "read credentials from classpath" approach here
> probably won't work. This is another ripe area for some design as we move
> forward with Beam.
>

Agreed. The example that tests this uses the DirectPipelineRunner as a
kludge for reading the credentials from your local machine. You can
potentially pass credentials in at run-time as PipelineOptions as well. A
lot of this depends on individual use cases so something more flexible
would be appreciated.



> *) I think you acknowledged this in the post, but: I'm a little concerned
> about data loss here -- it looks like if the S3 copy fails, we do not fail
> the bundle?
>

In this prototype implementation, data loss is a possibility.


> *) It looks like you write the entire contents of the file locally, then
> copy to S3. Is there a reason not to write directly to a channel that
> writes to an S3 file?
>

 The Sink API seems to assume a channel interface is available yet I
haven't been able to find a way to create such an interface using the AWS
Java SDK. I'm sure it is possible I just don't know how.

*) In general, we'd like a better failure handling story in Beam.
>

Agreed. This needs work and the current implementation is nowhere near
production ready.


> Selfishly, I'd love it if we could use you as an early review guinea
> pig for Beam contributions :).
>

I would definitely like to be part of that! We are going to be using
Dataflow/Beam for a few use cases and I would like to expand that within
the organization as the new SDK improves. I would also be interested in
helping to do some of the development for some of these ideas but would
need some guidance from people more familiar with the project.

Hopefully we can touch base again on these issues in the coming months,

Kevin

Re: Show: Amazon S3 Sink

Reply via email to