Re: Show: Amazon S3 Sink

Dan Halperin Thu, 11 Feb 2016 17:38:00 -0800

Hi Kevin,

Thanks -- this is a nice post! I'm glad you were able to produce a sink for
s3 that is working for you.

I've been thinking about I/O for Dataflow, and thinking ahead to Beam, for
a while now. So, here's some thoughts on your code and, mixed inline, a few
design/feature requests for the Beam as a whole that we'll populate in JIRA
soon.

*) It looks like the S3 sink is roughly semantically equivalent to
TextIO.Write -- it writes strings to a file. Ideally, we could decouple the
destination file system (local disk, s3://, gs://, write to socket) from
the format of the files being written. Suppose that we had an S3
IOChannelFactory
<https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/IOChannelFactory.java>,
and TextIO.Write could use it -- would that have been enough?
   In general, it's unfortunate that -- as the code is set up right now --
it made sense for you to make an "s3 sink" vs a "s3 channel
interface". Improving IOChannelFactory's abstractions and making it generic
rather than Google Cloud Storage specific
<https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/GcsIOChannelFactory.java>
is one of our early Beam tasks.

*) Another thing we'd like to do better in Beam is handling of credentials.
Suppose I want to read two different files from S3, using two different
sets of credentials. The "read credentials from classpath" approach here
probably won't work. This is another ripe area for some design as we move
forward with Beam. In Dataflow, we essentially punted on this issue by
saying "well, the entire job needs one set of credentials" -- but it's been
something we know we need to fix going forward with Beam.

*) I think you acknowledged this in the post, but: I'm a little concerned
about data loss here -- it looks like if the S3 copy fails, we do not fail
the bundle? We'd want to wrap the S3 Put in a few retries, but we also want
the bundle to fail and be retried if we don't eventually succeed writing to
S3. Avoiding data loss is a primary requirement for us.  [* Caveat: I'm not
an S3 expert ]

*) It looks like you write the entire contents of the file locally, then
copy to S3. Is there a reason not to write directly to a channel that
writes to an S3 file? Seems like this would make the sink more scalable (S3
can hold bigger files than a hard drive, probably) and potentially faster
(a buffered write channel vs a file write + a network copy).

*) In general, we'd like a better failure handling story in Beam. As an
example, suppose you changed the S3 sink as I suggested -- retry the S3
copy, cause bundle failure if it did not eventually succeed.
   The job will retry the bundle (In Cloud Dataflow, we retry up to 4
times), but it might still fail if the bundle fails persistently (say, the
temp file it creates is too large for the local drive). In this case, we
may leave some successfully-written files to S3 but have a job failure. In
this case, I think we'd like to be able to remove all the files we
successfully wrote to S3 -- otherwise users will get angry at us for
racking up storage fees.
   In Dataflow, for TextIO.Write, we can "cheat" by using the Cloud
Dataflow service to do that cleanup outside of Java-land. We clearly need
to add some way to do this cleanup to Beam.

As for code review -- if you want to submit this as a contrib module to
Dataflow Java SDK, we'd be happy to take a look. (I've given you some of
the initial feedback in this mail.)
    Alternately, the refactoring and seeding of the Beam SDK is coming
soon. Selfishly, I'd love it if we could use you as an early review guinea
pig for Beam contributions :).

Thanks!
Dan

On Thu, Feb 11, 2016 at 2:14 PM, Kevin Sookocheff <
[email protected]> wrote:

> Hi everyone,
>
> I've been interested in the project and have been doing small contributions
> mostly around IO connectors for Dataflow.
>
> I recently wrote up an article on batch writing to S3 on my blog (pasted
> below). Would it be best to wait until the refactor and seeding of the Beam
> project to work this code through proper review?
>
> Also, any feedback and/or corrections on the article are welcome.
>
>
> Article Link: http://sookocheff.com/post/beam/writing-a-beam-sink/
>
> Thanks,
>
> Kevin
>

Re: Show: Amazon S3 Sink

Reply via email to