Dmitry Demeshchuk created BEAM-2572:
---------------------------------------

             Summary: Implement an S3 filesystem for Python SDK
                 Key: BEAM-2572
                 URL: https://issues.apache.org/jira/browse/BEAM-2572
             Project: Beam
          Issue Type: Task
          Components: sdk-py
            Reporter: Dmitry Demeshchuk
            Assignee: Ahmet Altay
            Priority: Minor


There are two paths worth exploring, to my understanding:

1. Sticking to the HDFS-based approach (like it's done in Java).
2. Using boto/boto3 for accessing S3 through its common API endpoints.

I personally prefer the second approach, for a few reasons:

1. In real life, HDFS and S3 have different consistency guarantees, therefore 
their behaviors may contradict each other in some edge cases (say, we write 
something to S3, but it's not immediately accessible for reading from another 
end).

2. There are other AWS-based sources and sinks we may want to create in the 
future: DynamoDB, Kinesis, SQS, etc.

3. boto3 already provides somewhat good logic for basic things like 
reattempting.

Whatever path we choose, there's another problem related to this: we currently 
cannot pass any global settings (say, pipeline options, or just an arbitrary 
kwarg) to a filesystem. Because of that, we'd have to setup the runner nodes to 
have AWS keys set up in the environment, which is not trivial to achieve and 
doesn't look too clean either (I'd rather see one single place for configuring 
the runner options).

Also, it's worth mentioning that I already have a janky S3 filesystem 
implementation that only supports DirectRunner at the moment (because of the 
previous paragraph). I'm perfectly fine finishing it myself, with some guidance 
from the maintainers.

Where should I move on from here, and whose input should I be looking for?

Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to