Re: Enable security for data channels in portability

Lukasz Cwik Mon, 29 Apr 2019 13:04:40 -0700

Changing the address to be loopback based upon how the environment is
started (docker container/process/external/...) makes sense.


How would the SDK and runner support storing/sharing this secret? (For
example, in the docker container, how would the secret get there?)

On Mon, Apr 29, 2019 at 9:23 AM Hai Lu <[email protected]> wrote:

> Hi Lukasz and Ankur,
>
> Thank you so much for your response! This is what we're doing/implementing
> in our internal fork right now:
>
>    1. We assume that the Java process and Python process *are always
>    colocated in the same host*, so first of all we use "loopback" address
>    instead of "any address" that's currently being used on the java side. That
>    way, the traffic between sdk worker and runner is limited to the host but
>    not exposed to network.
>    2. Because of the multi-tenant nature of our environment, we still
>    want to have authentication even for local host, so that data ports are not
>    connected by random processes. Because different jobs have their own user
>    name, it's sufficient to *use file system to store an ad-hoc secret*,
>    which can be shared by both Python sdk and java runner. The the runner uses
>    this secret to authenticate the worker (by using gRPC's interceptor for
>    this customized auth)
>    3. By having the 2 steps above, we *no longer need transport layer
>    security *(SSL/TLS). So we abandon our initial plan to enable SSL/TLS.
>
> Above is the high level plan that I'm implementing. I would like to have a
> similar solution in the open source to be merged with our internal fork.
> Let me know what you think. If this sounds OK I will create a ticket for
> myself and will first send out a short write-up in google doc to collect
> comments soon.
>
> Thanks,
> Hai
>
> On Fri, Apr 26, 2019 at 5:24 PM Ankur Goenka <[email protected]> wrote:
>
>> In an offline chat with Hai, It seem useful for users to be able to
>> provide custom authentication like a secret which can be distributed out of
>> band by the infrastructure and can be provided via file system, rpc to
>> another service etc.
>> gRPC already has some mechanism for standard and custom authentication[1].
>> Instrumenting gRPC channel using command line option or environment
>> variable on the worker machines can be be useful.
>>
>> [1] https://grpc.io/docs/guides/auth/
>>
>> On Fri, Apr 26, 2019 at 4:33 PM Lukasz Cwik <[email protected]> wrote:
>>
>>> The link to the ApiServiceDescriptor is
>>> https://github.com/apache/beam/blob/476e17ed6badd4d5c06c4caf8a824805f40a8e7a/model/pipeline/src/main/proto/endpoints.proto#L31
>>>
>>> On Fri, Apr 26, 2019 at 4:32 PM Lukasz Cwik <[email protected]> wrote:
>>>
>>>> I had originally taken a look at this a while ago but not much has
>>>> progressed since then. The original idea was that the ApiServiceDescriptor
>>>> would be extended to support secure ways of authentication/communication. I
>>>> was prototyping with an OAuth2 client credentials grant at the time but
>>>> dropped it as other things were more important. The only currently
>>>> supported mode across all SDKs is an implicit authenticated/secure mode
>>>> where all communication is assumed to already be encrypted/private (e.g.
>>>> over VPN that is managed externally with trusted services) and hence the
>>>> gRPC channel itself is insecure and there is no authentication being
>>>> performed.
>>>>
>>>> Even though sdk_worker.py seems like it supports credentials, no one
>>>> invokes the constructor with credentials enabled as can be seen by this
>>>> comment by Robert[1].
>>>>
>>>> For SSL/TLS support it seems like we need some way to configure a
>>>> runner to be told to use SSL/TLS (potentially with a custom private key and
>>>> trust chain). Do you have some suggestions on how we add support for
>>>> passing around channel/call[2] credentials?
>>>>
>>>> 1:
>>>> https://github.com/apache/beam/blob/476e17ed6badd4d5c06c4caf8a824805f40a8e7a/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L139
>>>> 2: https://grpc.io/docs/guides/auth/
>>>>
>>>> On Tue, Apr 23, 2019 at 5:06 PM Hai Lu <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> This is Hai from LinkedIn. Daniel and I have been working on
>>>>> productionizing Samza portable runner. BTW, Daniel didn't mention in his
>>>>> previous email that he has enabled and validated Python 3 for Samza runner
>>>>> and it worked smoothly. Kudos to the team!
>>>>>
>>>>> Here I have a few security related questions about portability. At
>>>>> LinkedIn, we enable SSL/TLS and ACLs for Kafka data and any data exchange.
>>>>> In the case of portable runner, we're required to secure the data channels
>>>>> between Java and Python processes as well because our Samza jobs are
>>>>> running in a multi-tenant environment. While I'm currently working on this
>>>>> on our internal branch, I do want to keep it clean and consistent with the
>>>>> master branch.
>>>>>
>>>>> My questions are: were there any plans/thoughts around security for
>>>>> portability? I see that sdk_worker.py does have some codes to create
>>>>> secured gRPC channels; is anyone actually leveraging those codes? I don't
>>>>> see on the Java side any work is done, though.
>>>>>
>>>>> Thanks,
>>>>> Hai Lu
>>>>>
>>>>

Re: Enable security for data channels in portability

Reply via email to