Well, I don’t think a hook (or task) should be obtain it by itself. It should be supplied. At the moment you start executing the task you cannot trust it anymore (ie. it is unmanaged / non airflow code).
So we could change the basehook to understand supplied credentials and populate a hash with “conn_ids”. Hooks normally call BaseHook.get_connection anyway, so it shouldnt be too hard and should in principle not require changes to the hooks themselves if they are well behaved. B. > On 28 Jul 2018, at 17:41, Dan Davydov <ddavy...@twitter.com.INVALID> wrote: > > *So basically in the scheduler we parse the dag. Either from the manifest > (new) or from smart parsing (probably harder, maybe some auto register?) we > know what connections and keytabs are available dag wide or per task.* > This is the hard part that I was curious about, for dynamically created > DAGs, e.g. those generated by reading tasks in a MySQL database or a json > file, there isn't a great way to do this. > > I 100% agree with deprecating the connections table (at least for the > secure option). The main work there is rewriting all hooks to take > credentials from arbitrary data sources by allowing a customized > CredentialsReader class. Although hooks are technically private, I think a > lot of companies depend on them so the PMC should probably discuss if this > is an Airflow 2.0 change or not. > > On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin <bdbr...@gmail.com> wrote: > >> Sure. In general I consider keytabs as a part of connection information. >> Connections should be secured by sending the connection information a task >> needs as part of information the executor gets. A task should then not need >> access to the connection table in Airflow. Keytabs could then be send as >> part of the connection information (base64 encoded) and setup by the >> executor (this key) to be read only to the task it is launching. >> >> So basically in the scheduler we parse the dag. Either from the manifest >> (new) or from smart parsing (probably harder, maybe some auto register?) we >> know what connections and keytabs are available dag wide or per task. >> >> The credentials and connection information then are serialized into a >> protobuf message and send to the executor as part of the “queue” action. >> The worker then deserializes the information and makes it securely >> available to the task (which is quite hard btw). >> >> On that last bit making the info securely available might be storing it in >> the Linux KEYRING (supported by python keyring). Keytabs will be tough to >> do properly due to Java not properly supporting KEYRING and only files and >> these are hard to make secure (due to the possibility a process will list >> all files in /tmp and get credentials through that). Maybe storing the >> keytab with a password and having the password in the KEYRING might work. >> Something to find out. >> >> B. >> >> Verstuurd vanaf mijn iPad >> >>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov <ddavy...@twitter.com.INVALID> >> het volgende geschreven: >>> >>> I'm curious if you had any ideas in terms of ideas to enable >> multi-tenancy >>> with respect to Kerberos in Airflow. >>> >>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <bdbr...@gmail.com> >> wrote: >>>> >>>> Cool. The doc will need some refinement as it isn't entirely accurate. >> In >>>> addition we need to separate between Airflow as a client of kerberized >>>> services (this is what is talked about in the astronomer doc) vs >>>> kerberizing airflow itself, which the API supports. >>>> >>>> In general to access kerberized services (airflow as a client) one needs >>>> to start the ticket renewer with a valid keytab. For the hooks it isn't >>>> always required to change the hook to support it. Hadoop cli tools often >>>> just pick it up as their client config is set to do so. Then another >> class >>>> is there for HTTP-like services which are accessed by urllib under the >>>> hood, these typically use SPNEGO. These often need to be adjusted as it >>>> requires some urllib config. Finally, there are protocols which use SASL >>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO). These require >> per >>>> protocol implementations. >>>> >>>> From the top of my head we support kerberos client side now with: >>>> >>>> * Spark >>>> * HDFS (snakebite python 2.7, cli and with the upcoming libhdfs >>>> implementation) >>>> * Hive (not metastore afaik) >>>> >>>> Two things to remember: >>>> >>>> * If a job (ie. Spark job) will finish later than the maximum ticket >>>> lifetime you probably need to provide a keytab to said application. >>>> Otherwise you will get failures after the expiry. >>>> * A keytab (used by the renewer) are credentials (user and pass) so jobs >>>> are executed under the keytab in use at that moment >>>> * Securing keytab in multi tenancy airflow is a challenge. This also >> goes >>>> for securing connections. This we need to fix at some point. Solution >> for >>>> now seems to be no multi tenancy. >>>> >>>> Kerberos seems harder than it is btw. Still, we are sometimes moving >> away >>>> from it to OAUTH2 based authentication. This gets use closer to cloud >>>> standards (but we are on prem) >>>> >>>> B. >>>> >>>> Sent from my iPhone >>>> >>>>> On 27 Jul 2018, at 17:41, Hitesh Shah <hit...@apache.org> wrote: >>>>> >>>>> Hi Taylor >>>>> >>>>> +1 on upstreaming this. It would be great if you can submit a pull >>>> request >>>>> to enhance the apache airflow docs. >>>>> >>>>> thanks >>>>> Hitesh >>>>> >>>>> >>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston <tedmis...@gmail.com> >>>> wrote: >>>>>> >>>>>> While we're on the topic, I'd love any feedback from Bolke or others >>>> who've >>>>>> used Kerberos with Airflow on this quick guide I put together >> yesterday. >>>>>> It's similar to what's in the Airflow docs but instead all on one page >>>>>> and slightly >>>>>> expanded. >>>>>> >>>>>> >>>>>> >>>> >> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md >>>>>> (or web version <https://www.astronomer.io/guides/kerberos/>) >>>>>> >>>>>> One thing I'd like to add is a minimal example of how to Kerberize a >>>> hook. >>>>>> >>>>>> I'd be happy to upstream this as well if it's useful (maybe a >> Concepts > >>>>>> Additional Functionality > Kerberos page?) >>>>>> >>>>>> Best, >>>>>> Taylor >>>>>> >>>>>> >>>>>> *Taylor Edmiston* >>>>>> Blog <https://blog.tedmiston.com/> | CV >>>>>> <https://stackoverflow.com/cv/taylor> | LinkedIn >>>>>> <https://www.linkedin.com/in/tedmiston/> | AngelList >>>>>> <https://angel.co/taylor> | Stack Overflow >>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston> >>>>>> >>>>>> >>>>>> On Thu, Jul 26, 2018 at 5:18 PM, Driesprong, Fokko >> <fo...@driesprong.frl >>>>> >>>>>> wrote: >>>>>> >>>>>>> Hi Ry, >>>>>>> >>>>>>> You should ask Bolke de Bruin. He's really experienced with Kerberos >>>> and >>>>>> he >>>>>>> did also the implementation for Airflow. Beside that he worked also >> on >>>>>>> implementing Kerberos in Ambari. Just want to let you know. >>>>>>> >>>>>>> Cheers, Fokko >>>>>>> >>>>>>> Op do 26 jul. 2018 om 23:03 schreef Ry Walker <r...@astronomer.io> >>>>>>> >>>>>>>> Hi everyone - >>>>>>>> >>>>>>>> We have several bigCo's who are considering using Airflow asking >> into >>>>>> its >>>>>>>> support for Kerberos. >>>>>>>> >>>>>>>> We're going to work on a proof-of-concept next week, will likely >>>>>> record a >>>>>>>> screencast on it. >>>>>>>> >>>>>>>> For now, we're looking for any anecdotal information from >>>> organizations >>>>>>> who >>>>>>>> are using Kerberos with Airflow, if anyone would be willing to share >>>>>>> their >>>>>>>> experiences here, or reply to me personally, it would be greatly >>>>>>>> appreciated! >>>>>>>> >>>>>>>> -Ry >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> *Ry Walker* | CEO, Astronomer <http://www.astronomer.io/> | >>>>>>> 513.417.2163 | >>>>>>>> @rywalker <http://twitter.com/rywalker> | LinkedIn >>>>>>>> <http://www.linkedin.com/in/rywalker> >>>>>> >>>> >>