Re: Enable security for data channels in portability

Hai Lu Tue, 30 Apr 2019 09:24:27 -0700

One thing to clarify is that we do not use docker. I don't have too much
experience with docker; I assume docker itself already has network
isolation, and that's why it was never necessary to enable security in
portable runner before?


For us because we simply use processes, we need this extra secret (through
file system) for authentication.

Let me create a ticket and send a PR, which should explain my intention
better.

Thanks,
Hai

On Mon, Apr 29, 2019 at 1:03 PM Lukasz Cwik <lc...@google.com> wrote:

> Changing the address to be loopback based upon how the environment is
> started (docker container/process/external/...) makes sense.
>
> How would the SDK and runner support storing/sharing this secret? (For
> example, in the docker container, how would the secret get there?)
>
> On Mon, Apr 29, 2019 at 9:23 AM Hai Lu <lhai...@gmail.com> wrote:
>
>> Hi Lukasz and Ankur,
>>
>> Thank you so much for your response! This is what we're
>> doing/implementing in our internal fork right now:
>>
>>    1. We assume that the Java process and Python process *are always
>>    colocated in the same host*, so first of all we use "loopback"
>>    address instead of "any address" that's currently being used on the java
>>    side. That way, the traffic between sdk worker and runner is limited to 
>> the
>>    host but not exposed to network.
>>    2. Because of the multi-tenant nature of our environment, we still
>>    want to have authentication even for local host, so that data ports are 
>> not
>>    connected by random processes. Because different jobs have their own user
>>    name, it's sufficient to *use file system to store an ad-hoc secret*,
>>    which can be shared by both Python sdk and java runner. The the runner 
>> uses
>>    this secret to authenticate the worker (by using gRPC's interceptor for
>>    this customized auth)
>>    3. By having the 2 steps above, we *no longer need transport layer
>>    security *(SSL/TLS). So we abandon our initial plan to enable
>>    SSL/TLS.
>>
>> Above is the high level plan that I'm implementing. I would like to have
>> a similar solution in the open source to be merged with our internal fork.
>> Let me know what you think. If this sounds OK I will create a ticket for
>> myself and will first send out a short write-up in google doc to collect
>> comments soon.
>>
>> Thanks,
>> Hai
>>
>> On Fri, Apr 26, 2019 at 5:24 PM Ankur Goenka <goe...@google.com> wrote:
>>
>>> In an offline chat with Hai, It seem useful for users to be able to
>>> provide custom authentication like a secret which can be distributed out of
>>> band by the infrastructure and can be provided via file system, rpc to
>>> another service etc.
>>> gRPC already has some mechanism for standard and custom
>>> authentication[1].
>>> Instrumenting gRPC channel using command line option or environment
>>> variable on the worker machines can be be useful.
>>>
>>> [1] https://grpc.io/docs/guides/auth/
>>>
>>> On Fri, Apr 26, 2019 at 4:33 PM Lukasz Cwik <lc...@google.com> wrote:
>>>
>>>> The link to the ApiServiceDescriptor is
>>>> https://github.com/apache/beam/blob/476e17ed6badd4d5c06c4caf8a824805f40a8e7a/model/pipeline/src/main/proto/endpoints.proto#L31
>>>>
>>>> On Fri, Apr 26, 2019 at 4:32 PM Lukasz Cwik <lc...@google.com> wrote:
>>>>
>>>>> I had originally taken a look at this a while ago but not much has
>>>>> progressed since then. The original idea was that the ApiServiceDescriptor
>>>>> would be extended to support secure ways of authentication/communication. 
>>>>> I
>>>>> was prototyping with an OAuth2 client credentials grant at the time but
>>>>> dropped it as other things were more important. The only currently
>>>>> supported mode across all SDKs is an implicit authenticated/secure mode
>>>>> where all communication is assumed to already be encrypted/private (e.g.
>>>>> over VPN that is managed externally with trusted services) and hence the
>>>>> gRPC channel itself is insecure and there is no authentication being
>>>>> performed.
>>>>>
>>>>> Even though sdk_worker.py seems like it supports credentials, no one
>>>>> invokes the constructor with credentials enabled as can be seen by this
>>>>> comment by Robert[1].
>>>>>
>>>>> For SSL/TLS support it seems like we need some way to configure a
>>>>> runner to be told to use SSL/TLS (potentially with a custom private key 
>>>>> and
>>>>> trust chain). Do you have some suggestions on how we add support for
>>>>> passing around channel/call[2] credentials?
>>>>>
>>>>> 1:
>>>>> https://github.com/apache/beam/blob/476e17ed6badd4d5c06c4caf8a824805f40a8e7a/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L139
>>>>> 2: https://grpc.io/docs/guides/auth/
>>>>>
>>>>> On Tue, Apr 23, 2019 at 5:06 PM Hai Lu <lhai...@apache.org> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> This is Hai from LinkedIn. Daniel and I have been working on
>>>>>> productionizing Samza portable runner. BTW, Daniel didn't mention in his
>>>>>> previous email that he has enabled and validated Python 3 for Samza 
>>>>>> runner
>>>>>> and it worked smoothly. Kudos to the team!
>>>>>>
>>>>>> Here I have a few security related questions about portability. At
>>>>>> LinkedIn, we enable SSL/TLS and ACLs for Kafka data and any data 
>>>>>> exchange.
>>>>>> In the case of portable runner, we're required to secure the data 
>>>>>> channels
>>>>>> between Java and Python processes as well because our Samza jobs are
>>>>>> running in a multi-tenant environment. While I'm currently working on 
>>>>>> this
>>>>>> on our internal branch, I do want to keep it clean and consistent with 
>>>>>> the
>>>>>> master branch.
>>>>>>
>>>>>> My questions are: were there any plans/thoughts around security for
>>>>>> portability? I see that sdk_worker.py does have some codes to create
>>>>>> secured gRPC channels; is anyone actually leveraging those codes? I don't
>>>>>> see on the Java side any work is done, though.
>>>>>>
>>>>>> Thanks,
>>>>>> Hai Lu
>>>>>>
>>>>>

Re: Enable security for data channels in portability

Reply via email to