Hi Kevin, Thanks -- this is a nice post! I'm glad you were able to produce a sink for s3 that is working for you.
I've been thinking about I/O for Dataflow, and thinking ahead to Beam, for a while now. So, here's some thoughts on your code and, mixed inline, a few design/feature requests for the Beam as a whole that we'll populate in JIRA soon. *) It looks like the S3 sink is roughly semantically equivalent to TextIO.Write -- it writes strings to a file. Ideally, we could decouple the destination file system (local disk, s3://, gs://, write to socket) from the format of the files being written. Suppose that we had an S3 IOChannelFactory <https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/IOChannelFactory.java>, and TextIO.Write could use it -- would that have been enough? In general, it's unfortunate that -- as the code is set up right now -- it made sense for you to make an "s3 sink" vs a "s3 channel interface". Improving IOChannelFactory's abstractions and making it generic rather than Google Cloud Storage specific <https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/GcsIOChannelFactory.java> is one of our early Beam tasks. *) Another thing we'd like to do better in Beam is handling of credentials. Suppose I want to read two different files from S3, using two different sets of credentials. The "read credentials from classpath" approach here probably won't work. This is another ripe area for some design as we move forward with Beam. In Dataflow, we essentially punted on this issue by saying "well, the entire job needs one set of credentials" -- but it's been something we know we need to fix going forward with Beam. *) I think you acknowledged this in the post, but: I'm a little concerned about data loss here -- it looks like if the S3 copy fails, we do not fail the bundle? We'd want to wrap the S3 Put in a few retries, but we also want the bundle to fail and be retried if we don't eventually succeed writing to S3. Avoiding data loss is a primary requirement for us. [* Caveat: I'm not an S3 expert ] *) It looks like you write the entire contents of the file locally, then copy to S3. Is there a reason not to write directly to a channel that writes to an S3 file? Seems like this would make the sink more scalable (S3 can hold bigger files than a hard drive, probably) and potentially faster (a buffered write channel vs a file write + a network copy). *) In general, we'd like a better failure handling story in Beam. As an example, suppose you changed the S3 sink as I suggested -- retry the S3 copy, cause bundle failure if it did not eventually succeed. The job will retry the bundle (In Cloud Dataflow, we retry up to 4 times), but it might still fail if the bundle fails persistently (say, the temp file it creates is too large for the local drive). In this case, we may leave some successfully-written files to S3 but have a job failure. In this case, I think we'd like to be able to remove all the files we successfully wrote to S3 -- otherwise users will get angry at us for racking up storage fees. In Dataflow, for TextIO.Write, we can "cheat" by using the Cloud Dataflow service to do that cleanup outside of Java-land. We clearly need to add some way to do this cleanup to Beam. As for code review -- if you want to submit this as a contrib module to Dataflow Java SDK, we'd be happy to take a look. (I've given you some of the initial feedback in this mail.) Alternately, the refactoring and seeding of the Beam SDK is coming soon. Selfishly, I'd love it if we could use you as an early review guinea pig for Beam contributions :). Thanks! Dan On Thu, Feb 11, 2016 at 2:14 PM, Kevin Sookocheff < [email protected]> wrote: > Hi everyone, > > I've been interested in the project and have been doing small contributions > mostly around IO connectors for Dataflow. > > I recently wrote up an article on batch writing to S3 on my blog (pasted > below). Would it be best to wait until the refactor and seeding of the Beam > project to work this code through proper review? > > Also, any feedback and/or corrections on the article are welcome. > > > Article Link: http://sookocheff.com/post/beam/writing-a-beam-sink/ > > Thanks, > > Kevin >
