[ 
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086675#comment-16086675
 ] 

Dmitry Demeshchuk commented on BEAM-2572:
-----------------------------------------

[~altay] I actually had to struggle quite a lot before I could make it work 
properly. Juliaset example at some point stopped working for me (some 
setuptools-related issues, which I'm still to reproduce and report here in 
JIRA), so I spent a couple days before having made it work for my use case 
(installing psycopg2 dependencies). It involved talking to people, reading the 
docs on setuptools and distutils, plus a lot of debugging of Dataflow jobs. I 
can tell for sure that if any other data engineers or data scientists decided 
to go the same path, they would be very likely to just give up.

The interface we ended up having at Postmates was basically this:
{code}
import dataflow

p = dataflow.Pipeline(
    'my-namespace',
    provision=[
        ['apt-get', 'install', '-y', 'libpsql-dev'],
        ['pip', 'install', '-y', 'psycopg2'],
    ]
)
{code}

While I think some of the decisions here (hiding pipeline options object, etc) 
were questionable, it was at least much easier for people to just write a 
single Python script and make things run on Dataflow, without learning about 
the complications of dependency handling, or the way setuptools work.

I also understand that this approach may be not usable for non-Dataflow runners 
(although we don't have any other for Python yet, besides the direct one). But 
I do think that saying "if you use AWS sources and sinks, you'd have to write a 
setup.py file and do some magic" is a bit of an overkill.

> Implement an S3 filesystem for Python SDK
> -----------------------------------------
>
>                 Key: BEAM-2572
>                 URL: https://issues.apache.org/jira/browse/BEAM-2572
>             Project: Beam
>          Issue Type: Task
>          Components: sdk-py
>            Reporter: Dmitry Demeshchuk
>            Assignee: Ahmet Altay
>            Priority: Minor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore 
> their behaviors may contradict each other in some edge cases (say, we write 
> something to S3, but it's not immediately accessible for reading from another 
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the 
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like 
> reattempting.
> Whatever path we choose, there's another problem related to this: we 
> currently cannot pass any global settings (say, pipeline options, or just an 
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the 
> runner nodes to have AWS keys set up in the environment, which is not trivial 
> to achieve and doesn't look too clean either (I'd rather see one single place 
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem 
> implementation that only supports DirectRunner at the moment (because of the 
> previous paragraph). I'm perfectly fine finishing it myself, with some 
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to