[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK

Dmitry Demeshchuk (JIRA) Thu, 13 Jul 2017 14:27:35 -0700

    [ 
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086434#comment-16086434
 ]


Dmitry Demeshchuk commented on BEAM-2572:
-----------------------------------------

Couple problems that come to my mind about environment-originated configuration:

1. How do we configure the runner's environment in the first place, on the user 
level? Another pipeline option? Or make users hack their solution themselves? I 
agree that it's technically possible to do, just like provisioning a Dataflow 
container from inside Beam is, but it currently requires a lot of 
trial-and-error hacking. If we go that path, I'd like to first figure out this 
environment configuration piece first, because without it the FileSystem 
implementation would be useless.

2. Some people on this thread (and on the mailing list) mentioned that we may 
want to have multiple sets of credentials. Reading/writing may be using 
separate accounts/tokens, as well as accessing different buckets may. How would 
we configure that through the environment? Separating reading/writing concerns 
seems doable, but I'm not so sure about per-bucket access, for instance. Maybe 
it's fine saying "we won't support that, at least for now".

3. It feels like environment may be a bit too generally accessible/visible, 
which makes accident leaking of credentials much easier. Maybe we should be 
storing them at least in files, e.g. {{~/.aws/credentials}} or 
{{~/.config/gcloud/}}? But then, it makes multi-credential access a bit 
trickier.

> Implement an S3 filesystem for Python SDK
> -----------------------------------------
>
>                 Key: BEAM-2572
>                 URL: https://issues.apache.org/jira/browse/BEAM-2572
>             Project: Beam
>          Issue Type: Task
>          Components: sdk-py
>            Reporter: Dmitry Demeshchuk
>            Assignee: Ahmet Altay
>            Priority: Minor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore 
> their behaviors may contradict each other in some edge cases (say, we write 
> something to S3, but it's not immediately accessible for reading from another 
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the 
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like 
> reattempting.
> Whatever path we choose, there's another problem related to this: we 
> currently cannot pass any global settings (say, pipeline options, or just an 
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the 
> runner nodes to have AWS keys set up in the environment, which is not trivial 
> to achieve and doesn't look too clean either (I'd rather see one single place 
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem 
> implementation that only supports DirectRunner at the moment (because of the 
> previous paragraph). I'm perfectly fine finishing it myself, with some 
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK

Reply via email to