[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK

Dmitry Demeshchuk (JIRA) Thu, 13 Jul 2017 15:37:14 -0700

    [ 
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086535#comment-16086535
 ]


Dmitry Demeshchuk commented on BEAM-2572:
-----------------------------------------

re 1: I just don't want us to end up in a situation like this:

List: We just released an S3 filesystem! Please use it and tell us what you 
think!
User7231: Hi, how do I provide credentials for the filesystem, in case I run my 
stuff on Dataflow?
List: Just set up envrionment variables AWS_ACCESS_KEY_ID and 
AWS_SECRET_ACCESS_KEY on your Dataflow nodes!
User7231: Cool, how can I do that?
List: Well, there's no official way, so you just hack yourself a custom 
package, or something like that!

We only have two runners for Python right now: Direct and Dataflow. I think it 
would make sense to make things runnable in Dataflow too, even if configuring 
the environment is going to be a Dataflow-specific mechanism, totally 
independent from Beam. What worries me about making it a Dataflow feature is 
that the whole Beam S3 feature will become dependent on the Dataflow planning 
and release cycle, before it can be somewhat usable to people.

re 2, 3: That's a good point. FWIW, I'm all in for reducing the scope and 
complexity of this feature. Would rather have a non-ideal solution in a month, 
than an ideal solution someday.


I apologize for dragging this conversation so far out, there just seems no 
clear consensus on the subject, and I really want this to be usable beyond just 
the direct runner.

> Implement an S3 filesystem for Python SDK
> -----------------------------------------
>
>                 Key: BEAM-2572
>                 URL: https://issues.apache.org/jira/browse/BEAM-2572
>             Project: Beam
>          Issue Type: Task
>          Components: sdk-py
>            Reporter: Dmitry Demeshchuk
>            Assignee: Ahmet Altay
>            Priority: Minor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore 
> their behaviors may contradict each other in some edge cases (say, we write 
> something to S3, but it's not immediately accessible for reading from another 
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the 
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like 
> reattempting.
> Whatever path we choose, there's another problem related to this: we 
> currently cannot pass any global settings (say, pipeline options, or just an 
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the 
> runner nodes to have AWS keys set up in the environment, which is not trivial 
> to achieve and doesn't look too clean either (I'd rather see one single place 
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem 
> implementation that only supports DirectRunner at the moment (because of the 
> previous paragraph). I'm perfectly fine finishing it myself, with some 
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK

Reply via email to