Well, I don’t think a hook (or task) should be obtain it by itself. It should 
be supplied.
At the moment you start executing the task you cannot trust it anymore (ie. it 
is unmanaged 
/ non airflow code).

So we could change the basehook to understand supplied credentials and populate
a hash with “conn_ids”. Hooks normally call BaseHook.get_connection anyway, so
it shouldnt be too hard and should in principle not require changes to the hooks
themselves if they are well behaved.

B.

> On 28 Jul 2018, at 17:41, Dan Davydov <ddavy...@twitter.com.INVALID> wrote:
> 
> *So basically in the scheduler we parse the dag. Either from the manifest
> (new) or from smart parsing (probably harder, maybe some auto register?) we
> know what connections and keytabs are available dag wide or per task.*
> This is the hard part that I was curious about, for dynamically created
> DAGs, e.g. those generated by reading tasks in a MySQL database or a json
> file, there isn't a great way to do this.
> 
> I 100% agree with deprecating the connections table (at least for the
> secure option). The main work there is rewriting all hooks to take
> credentials from arbitrary data sources by allowing a customized
> CredentialsReader class. Although hooks are technically private, I think a
> lot of companies depend on them so the PMC should probably discuss if this
> is an Airflow 2.0 change or not.
> 
> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin <bdbr...@gmail.com> wrote:
> 
>> Sure. In general I consider keytabs as a part of connection information.
>> Connections should be secured by sending the connection information a task
>> needs as part of information the executor gets. A task should then not need
>> access to the connection table in Airflow. Keytabs could then be send as
>> part of the connection information (base64 encoded) and setup by the
>> executor (this key) to be read only to the task it is launching.
>> 
>> So basically in the scheduler we parse the dag. Either from the manifest
>> (new) or from smart parsing (probably harder, maybe some auto register?) we
>> know what connections and keytabs are available dag wide or per task.
>> 
>> The credentials and connection information then are serialized into a
>> protobuf message and send to the executor as part of the “queue” action.
>> The worker then deserializes the information and makes it securely
>> available to the task (which is quite hard btw).
>> 
>> On that last bit making the info securely available might be storing it in
>> the Linux KEYRING (supported by python keyring). Keytabs will be tough to
>> do properly due to Java not properly supporting KEYRING and only files and
>> these are hard to make secure (due to the possibility a process will list
>> all files in /tmp and get credentials through that). Maybe storing the
>> keytab with a password and having the password in the KEYRING might work.
>> Something to find out.
>> 
>> B.
>> 
>> Verstuurd vanaf mijn iPad
>> 
>>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov <ddavy...@twitter.com.INVALID>
>> het volgende geschreven:
>>> 
>>> I'm curious if you had any ideas in terms of ideas to enable
>> multi-tenancy
>>> with respect to Kerberos in Airflow.
>>> 
>>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <bdbr...@gmail.com>
>> wrote:
>>>> 
>>>> Cool. The doc will need some refinement as it isn't entirely accurate.
>> In
>>>> addition we need to separate between Airflow as a client of kerberized
>>>> services (this is what is talked about in the astronomer doc) vs
>>>> kerberizing airflow itself, which the API supports.
>>>> 
>>>> In general to access kerberized services (airflow as a client) one needs
>>>> to start the ticket renewer with a valid keytab. For the hooks it isn't
>>>> always required to change the hook to support it. Hadoop cli tools often
>>>> just pick it up as their client config is set to do so. Then another
>> class
>>>> is there for HTTP-like services which are accessed by urllib under the
>>>> hood, these typically use SPNEGO. These often need to be adjusted as it
>>>> requires some urllib config. Finally, there are protocols which use SASL
>>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO). These require
>> per
>>>> protocol implementations.
>>>> 
>>>> From the top of my head we support kerberos client side now with:
>>>> 
>>>> * Spark
>>>> * HDFS (snakebite python 2.7, cli and with the upcoming libhdfs
>>>> implementation)
>>>> * Hive (not metastore afaik)
>>>> 
>>>> Two things to remember:
>>>> 
>>>> * If a job (ie. Spark job) will finish later than the maximum ticket
>>>> lifetime you probably need to provide a keytab to said application.
>>>> Otherwise you will get failures after the expiry.
>>>> * A keytab (used by the renewer) are credentials (user and pass) so jobs
>>>> are executed under the keytab in use at that moment
>>>> * Securing keytab in multi tenancy airflow is a challenge. This also
>> goes
>>>> for securing connections. This we need to fix at some point. Solution
>> for
>>>> now seems to be no multi tenancy.
>>>> 
>>>> Kerberos seems harder than it is btw. Still, we are sometimes moving
>> away
>>>> from it to OAUTH2 based authentication. This gets use closer to cloud
>>>> standards (but we are on prem)
>>>> 
>>>> B.
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>> On 27 Jul 2018, at 17:41, Hitesh Shah <hit...@apache.org> wrote:
>>>>> 
>>>>> Hi Taylor
>>>>> 
>>>>> +1 on upstreaming this. It would be great if you can submit a pull
>>>> request
>>>>> to enhance the apache airflow docs.
>>>>> 
>>>>> thanks
>>>>> Hitesh
>>>>> 
>>>>> 
>>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston <tedmis...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>> While we're on the topic, I'd love any feedback from Bolke or others
>>>> who've
>>>>>> used Kerberos with Airflow on this quick guide I put together
>> yesterday.
>>>>>> It's similar to what's in the Airflow docs but instead all on one page
>>>>>> and slightly
>>>>>> expanded.
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md
>>>>>> (or web version <https://www.astronomer.io/guides/kerberos/>)
>>>>>> 
>>>>>> One thing I'd like to add is a minimal example of how to Kerberize a
>>>> hook.
>>>>>> 
>>>>>> I'd be happy to upstream this as well if it's useful (maybe a
>> Concepts >
>>>>>> Additional Functionality > Kerberos page?)
>>>>>> 
>>>>>> Best,
>>>>>> Taylor
>>>>>> 
>>>>>> 
>>>>>> *Taylor Edmiston*
>>>>>> Blog <https://blog.tedmiston.com/> | CV
>>>>>> <https://stackoverflow.com/cv/taylor> | LinkedIn
>>>>>> <https://www.linkedin.com/in/tedmiston/> | AngelList
>>>>>> <https://angel.co/taylor> | Stack Overflow
>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston>
>>>>>> 
>>>>>> 
>>>>>> On Thu, Jul 26, 2018 at 5:18 PM, Driesprong, Fokko
>> <fo...@driesprong.frl
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Ry,
>>>>>>> 
>>>>>>> You should ask Bolke de Bruin. He's really experienced with Kerberos
>>>> and
>>>>>> he
>>>>>>> did also the implementation for Airflow. Beside that he worked also
>> on
>>>>>>> implementing Kerberos in Ambari. Just want to let you know.
>>>>>>> 
>>>>>>> Cheers, Fokko
>>>>>>> 
>>>>>>> Op do 26 jul. 2018 om 23:03 schreef Ry Walker <r...@astronomer.io>
>>>>>>> 
>>>>>>>> Hi everyone -
>>>>>>>> 
>>>>>>>> We have several bigCo's who are considering using Airflow asking
>> into
>>>>>> its
>>>>>>>> support for Kerberos.
>>>>>>>> 
>>>>>>>> We're going to work on a proof-of-concept next week, will likely
>>>>>> record a
>>>>>>>> screencast on it.
>>>>>>>> 
>>>>>>>> For now, we're looking for any anecdotal information from
>>>> organizations
>>>>>>> who
>>>>>>>> are using Kerberos with Airflow, if anyone would be willing to share
>>>>>>> their
>>>>>>>> experiences here, or reply to me personally, it would be greatly
>>>>>>>> appreciated!
>>>>>>>> 
>>>>>>>> -Ry
>>>>>>>> 
>>>>>>>> --
>>>>>>>> 
>>>>>>>> *Ry Walker* | CEO, Astronomer <http://www.astronomer.io/> |
>>>>>>> 513.417.2163 |
>>>>>>>> @rywalker <http://twitter.com/rywalker> | LinkedIn
>>>>>>>> <http://www.linkedin.com/in/rywalker>
>>>>>> 
>>>> 
>> 

Reply via email to