[ https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078746#comment-16078746 ]
Dmitry Demeshchuk commented on BEAM-2572: ----------------------------------------- The biggest blocker for me right now is to make the source and the sink aware of the AWS credentials. Specifically, I'm seeing two possible user-friendly approaches to this, by either adding some pipeline option(s) and making them accessible, or by passing extra arguments to the source/sink. {code} pipeline_options=PipelineOptions(aws_access_key_id='bla', aws_secret_access_key='bla', aws_default_region='us-west-2', ...) p = Pipeline(options=pipeline_options) (p | 'read_from_s3' >> ReadFromText('s3://mybucket/some/path') ... ) {code} or {code} (p | 'read_from_s3' >> ReadFromText('s3://mybucket/some/path', aws_config=AWSConfig(...)) ... ) {code} The former would be my preference (seems more user-friendly, easier to reuse if one needs multiple AWS sources/sinks in the same pipeline). Thoughts? > Implement an S3 filesystem for Python SDK > ----------------------------------------- > > Key: BEAM-2572 > URL: https://issues.apache.org/jira/browse/BEAM-2572 > Project: Beam > Issue Type: Task > Components: sdk-py > Reporter: Dmitry Demeshchuk > Assignee: Ahmet Altay > Priority: Minor > > There are two paths worth exploring, to my understanding: > 1. Sticking to the HDFS-based approach (like it's done in Java). > 2. Using boto/boto3 for accessing S3 through its common API endpoints. > I personally prefer the second approach, for a few reasons: > 1. In real life, HDFS and S3 have different consistency guarantees, therefore > their behaviors may contradict each other in some edge cases (say, we write > something to S3, but it's not immediately accessible for reading from another > end). > 2. There are other AWS-based sources and sinks we may want to create in the > future: DynamoDB, Kinesis, SQS, etc. > 3. boto3 already provides somewhat good logic for basic things like > reattempting. > Whatever path we choose, there's another problem related to this: we > currently cannot pass any global settings (say, pipeline options, or just an > arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the > runner nodes to have AWS keys set up in the environment, which is not trivial > to achieve and doesn't look too clean either (I'd rather see one single place > for configuring the runner options). > Also, it's worth mentioning that I already have a janky S3 filesystem > implementation that only supports DirectRunner at the moment (because of the > previous paragraph). I'm perfectly fine finishing it myself, with some > guidance from the maintainers. > Where should I move on from here, and whose input should I be looking for? > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029)