[ 
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085074#comment-16085074
 ] 

Chamikara Jayalath commented on BEAM-2572:
------------------------------------------

Hi Dmitry,

I think it might be better reduce the scope to provide credentials for 
accessing remote file-systems to FileSystem sub-classes. Any transform that 
needs credentials to access a third party service outside of this could 
possibly employ a similar technique but I don't think it makes sense to enforce 
the same credential access mechanism for all transforms.

We should possibly try to prevent passing credentials (or any other state) 
required by FileSystem interfaces through IO trasnforms (such as 
ReadFromText/WriteToText). FileSystem abstraction is a tool used by some of the 
transforms. Providing state needed by FileSystem objects through transform 
interfaces could result in an explosion of the number of parameters that we'll 
have to provide as you mentioned in one of your comments (also note that Python 
SDK uses keyword arguments as opposed to builder pattern used by Java SDK).

I think, the issue here is FileSystem objects being instantiated by the 
SDK/runner in the background instead of being directly instantiated by pipeline 
authors when defining a pipeline. Abstractions such as transforms, sources, 
sinks, DoFns get directly instantiated by pipeline authors so these 
abstractions do not have the problem of having to acquire state through a 
secondary mechanism. The solution that makes most sense to me is to get any 
required state (e.g. credentials) from environment as mentioned in some of the 
comments. How this state get set in the environment is runner specific. For 
DataflowRunner it could be through a separate package that gets installed in 
workers while for DirectRunner environment could be directly setup by users 
when defining the pipeline.

> Implement an S3 filesystem for Python SDK
> -----------------------------------------
>
>                 Key: BEAM-2572
>                 URL: https://issues.apache.org/jira/browse/BEAM-2572
>             Project: Beam
>          Issue Type: Task
>          Components: sdk-py
>            Reporter: Dmitry Demeshchuk
>            Assignee: Ahmet Altay
>            Priority: Minor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore 
> their behaviors may contradict each other in some edge cases (say, we write 
> something to S3, but it's not immediately accessible for reading from another 
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the 
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like 
> reattempting.
> Whatever path we choose, there's another problem related to this: we 
> currently cannot pass any global settings (say, pipeline options, or just an 
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the 
> runner nodes to have AWS keys set up in the environment, which is not trivial 
> to achieve and doesn't look too clean either (I'd rather see one single place 
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem 
> implementation that only supports DirectRunner at the moment (because of the 
> previous paragraph). I'm perfectly fine finishing it myself, with some 
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to