[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK

Steve Loughran (JIRA) Fri, 14 Jul 2017 02:17:34 -0700

    [ 
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16087071#comment-16087071
 ]


Steve Loughran commented on BEAM-2572:
--------------------------------------

Worth mentioning a couple of recent changes in Hadoop S3A you should anticipate 
as a need

# server side encryption via AMS Key Management Service. Here the client 
declares that they want to use SSE-KMS & then provide the name of the key to 
encrypt/decrypt
# session keys, which need (userID, session-secret, session-ID). 
# support for different endpoints for different buckets (AWS v4 auth mandates 
you declare this, rather than rely on the central one. As these stayed up 
during the great S3 outage, worth doing). [our 
list|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/test/resources/core-site.xml]

We've ended supporting per-bucket configs, where you can config the cluster 
with different options for different endpoints; as well as the 
fs.s3a.secret,key, fs.s3a.endpoint.key, etc, we now let you define 
fs.s3a.bucket.${bucketname}.secret.key, &c; these take priority.

We've also tried to reduce the #of times that secrets appear in logs with the 
embedded-in-URI mechanism of s3a://id:secret/bucket/data, by stripping it from 
the toString() value. This hasn't worked & I might revert it. Why? too much 
code assumes that you can go Path -> String -> Path losslesly, as a simple form 
of Serialization. Unless they all move to Path -> URI -> serialize -> URI -> 
Path things don't work

> Implement an S3 filesystem for Python SDK
> -----------------------------------------
>
>                 Key: BEAM-2572
>                 URL: https://issues.apache.org/jira/browse/BEAM-2572
>             Project: Beam
>          Issue Type: Task
>          Components: sdk-py
>            Reporter: Dmitry Demeshchuk
>            Assignee: Ahmet Altay
>            Priority: Minor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore 
> their behaviors may contradict each other in some edge cases (say, we write 
> something to S3, but it's not immediately accessible for reading from another 
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the 
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like 
> reattempting.
> Whatever path we choose, there's another problem related to this: we 
> currently cannot pass any global settings (say, pipeline options, or just an 
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the 
> runner nodes to have AWS keys set up in the environment, which is not trivial 
> to achieve and doesn't look too clean either (I'd rather see one single place 
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem 
> implementation that only supports DirectRunner at the moment (because of the 
> previous paragraph). I'm perfectly fine finishing it myself, with some 
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK

Reply via email to