[
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078746#comment-16078746
]
Dmitry Demeshchuk commented on BEAM-2572:
-----------------------------------------
The biggest blocker for me right now is to make the source and the sink aware
of the AWS credentials.
Specifically, I'm seeing two possible user-friendly approaches to this, by
either adding some pipeline option(s) and making them accessible, or by passing
extra arguments to the source/sink.
{code}
pipeline_options=PipelineOptions(aws_access_key_id='bla',
aws_secret_access_key='bla', aws_default_region='us-west-2', ...)
p = Pipeline(options=pipeline_options)
(p
| 'read_from_s3' >> ReadFromText('s3://mybucket/some/path')
...
)
{code}
or
{code}
(p
| 'read_from_s3' >> ReadFromText('s3://mybucket/some/path',
aws_config=AWSConfig(...))
...
)
{code}
The former would be my preference (seems more user-friendly, easier to reuse if
one needs multiple AWS sources/sinks in the same pipeline).
Thoughts?
> Implement an S3 filesystem for Python SDK
> -----------------------------------------
>
> Key: BEAM-2572
> URL: https://issues.apache.org/jira/browse/BEAM-2572
> Project: Beam
> Issue Type: Task
> Components: sdk-py
> Reporter: Dmitry Demeshchuk
> Assignee: Ahmet Altay
> Priority: Minor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore
> their behaviors may contradict each other in some edge cases (say, we write
> something to S3, but it's not immediately accessible for reading from another
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like
> reattempting.
> Whatever path we choose, there's another problem related to this: we
> currently cannot pass any global settings (say, pipeline options, or just an
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the
> runner nodes to have AWS keys set up in the environment, which is not trivial
> to achieve and doesn't look too clean either (I'd rather see one single place
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem
> implementation that only supports DirectRunner at the moment (because of the
> previous paragraph). I'm perfectly fine finishing it myself, with some
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)