Here: https://github.com/bolkedebruin/airflow/tree/secure_connections <https://github.com/bolkedebruin/airflow/tree/secure_connections>
Is a working rudimentary implementation that allows securing the connections (only LocalExecutor at the moment) * It enforces the use of “conn_id” instead of the mix that we have now * A task if using “conn_id” has ‘auto-registered’ (which is a noop) its connections * The scheduler reads the connection informations and serializes it to json (which should be a different format, protobuf preferably) * The scheduler then sends this info to the executor * The executor puts this in the environment of the task (environment most likely not secure enough for us) * The BaseHook reads out this environment variable and does not need to touch the database The example_http_operator works, I havent tested any other. To make it work I just adjusted the hook and operator to use “conn_id” instead of the non standard http_conn_id. Makes sense? B. * The BaseHook is adjusted to not connect to the database > On 28 Jul 2018, at 17:50, Bolke de Bruin <bdbr...@gmail.com> wrote: > > Well, I don’t think a hook (or task) should be obtain it by itself. It should > be supplied. > At the moment you start executing the task you cannot trust it anymore (ie. > it is unmanaged > / non airflow code). > > So we could change the basehook to understand supplied credentials and > populate > a hash with “conn_ids”. Hooks normally call BaseHook.get_connection anyway, so > it shouldnt be too hard and should in principle not require changes to the > hooks > themselves if they are well behaved. > > B. > >> On 28 Jul 2018, at 17:41, Dan Davydov <ddavy...@twitter.com.INVALID >> <mailto:ddavy...@twitter.com.INVALID>> wrote: >> >> *So basically in the scheduler we parse the dag. Either from the manifest >> (new) or from smart parsing (probably harder, maybe some auto register?) we >> know what connections and keytabs are available dag wide or per task.* >> This is the hard part that I was curious about, for dynamically created >> DAGs, e.g. those generated by reading tasks in a MySQL database or a json >> file, there isn't a great way to do this. >> >> I 100% agree with deprecating the connections table (at least for the >> secure option). The main work there is rewriting all hooks to take >> credentials from arbitrary data sources by allowing a customized >> CredentialsReader class. Although hooks are technically private, I think a >> lot of companies depend on them so the PMC should probably discuss if this >> is an Airflow 2.0 change or not. >> >> On Fri, Jul 27, 2018 at 5:24 PM Bolke de Bruin <bdbr...@gmail.com >> <mailto:bdbr...@gmail.com>> wrote: >> >>> Sure. In general I consider keytabs as a part of connection information. >>> Connections should be secured by sending the connection information a task >>> needs as part of information the executor gets. A task should then not need >>> access to the connection table in Airflow. Keytabs could then be send as >>> part of the connection information (base64 encoded) and setup by the >>> executor (this key) to be read only to the task it is launching. >>> >>> So basically in the scheduler we parse the dag. Either from the manifest >>> (new) or from smart parsing (probably harder, maybe some auto register?) we >>> know what connections and keytabs are available dag wide or per task. >>> >>> The credentials and connection information then are serialized into a >>> protobuf message and send to the executor as part of the “queue” action. >>> The worker then deserializes the information and makes it securely >>> available to the task (which is quite hard btw). >>> >>> On that last bit making the info securely available might be storing it in >>> the Linux KEYRING (supported by python keyring). Keytabs will be tough to >>> do properly due to Java not properly supporting KEYRING and only files and >>> these are hard to make secure (due to the possibility a process will list >>> all files in /tmp and get credentials through that). Maybe storing the >>> keytab with a password and having the password in the KEYRING might work. >>> Something to find out. >>> >>> B. >>> >>> Verstuurd vanaf mijn iPad >>> >>>> Op 27 jul. 2018 om 22:04 heeft Dan Davydov <ddavy...@twitter.com.INVALID >>>> <mailto:ddavy...@twitter.com.INVALID>> >>> het volgende geschreven: >>>> >>>> I'm curious if you had any ideas in terms of ideas to enable >>> multi-tenancy >>>> with respect to Kerberos in Airflow. >>>> >>>>> On Fri, Jul 27, 2018 at 2:38 PM Bolke de Bruin <bdbr...@gmail.com >>>>> <mailto:bdbr...@gmail.com>> >>> wrote: >>>>> >>>>> Cool. The doc will need some refinement as it isn't entirely accurate. >>> In >>>>> addition we need to separate between Airflow as a client of kerberized >>>>> services (this is what is talked about in the astronomer doc) vs >>>>> kerberizing airflow itself, which the API supports. >>>>> >>>>> In general to access kerberized services (airflow as a client) one needs >>>>> to start the ticket renewer with a valid keytab. For the hooks it isn't >>>>> always required to change the hook to support it. Hadoop cli tools often >>>>> just pick it up as their client config is set to do so. Then another >>> class >>>>> is there for HTTP-like services which are accessed by urllib under the >>>>> hood, these typically use SPNEGO. These often need to be adjusted as it >>>>> requires some urllib config. Finally, there are protocols which use SASL >>>>> with kerberos. Like HDFS (not webhdfs, that uses SPNEGO). These require >>> per >>>>> protocol implementations. >>>>> >>>>> From the top of my head we support kerberos client side now with: >>>>> >>>>> * Spark >>>>> * HDFS (snakebite python 2.7, cli and with the upcoming libhdfs >>>>> implementation) >>>>> * Hive (not metastore afaik) >>>>> >>>>> Two things to remember: >>>>> >>>>> * If a job (ie. Spark job) will finish later than the maximum ticket >>>>> lifetime you probably need to provide a keytab to said application. >>>>> Otherwise you will get failures after the expiry. >>>>> * A keytab (used by the renewer) are credentials (user and pass) so jobs >>>>> are executed under the keytab in use at that moment >>>>> * Securing keytab in multi tenancy airflow is a challenge. This also >>> goes >>>>> for securing connections. This we need to fix at some point. Solution >>> for >>>>> now seems to be no multi tenancy. >>>>> >>>>> Kerberos seems harder than it is btw. Still, we are sometimes moving >>> away >>>>> from it to OAUTH2 based authentication. This gets use closer to cloud >>>>> standards (but we are on prem) >>>>> >>>>> B. >>>>> >>>>> Sent from my iPhone >>>>> >>>>>> On 27 Jul 2018, at 17:41, Hitesh Shah <hit...@apache.org >>>>>> <mailto:hit...@apache.org>> wrote: >>>>>> >>>>>> Hi Taylor >>>>>> >>>>>> +1 on upstreaming this. It would be great if you can submit a pull >>>>> request >>>>>> to enhance the apache airflow docs. >>>>>> >>>>>> thanks >>>>>> Hitesh >>>>>> >>>>>> >>>>>>> On Thu, Jul 26, 2018 at 2:32 PM Taylor Edmiston <tedmis...@gmail.com >>>>>>> <mailto:tedmis...@gmail.com>> >>>>> wrote: >>>>>>> >>>>>>> While we're on the topic, I'd love any feedback from Bolke or others >>>>> who've >>>>>>> used Kerberos with Airflow on this quick guide I put together >>> yesterday. >>>>>>> It's similar to what's in the Airflow docs but instead all on one page >>>>>>> and slightly >>>>>>> expanded. >>>>>>> >>>>>>> >>>>>>> >>>>> >>> https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md >>> >>> <https://github.com/astronomerio/airflow-guides/blob/master/guides/kerberos.md> >>>>>>> (or web version <https://www.astronomer.io/guides/kerberos/>) >>>>>>> >>>>>>> One thing I'd like to add is a minimal example of how to Kerberize a >>>>> hook. >>>>>>> >>>>>>> I'd be happy to upstream this as well if it's useful (maybe a >>> Concepts > >>>>>>> Additional Functionality > Kerberos page?) >>>>>>> >>>>>>> Best, >>>>>>> Taylor >>>>>>> >>>>>>> >>>>>>> *Taylor Edmiston* >>>>>>> Blog <https://blog.tedmiston.com/> | CV >>>>>>> <https://stackoverflow.com/cv/taylor> | LinkedIn >>>>>>> <https://www.linkedin.com/in/tedmiston/> | AngelList >>>>>>> <https://angel.co/taylor> | Stack Overflow >>>>>>> <https://stackoverflow.com/users/149428/taylor-edmiston> >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 26, 2018 at 5:18 PM, Driesprong, Fokko >>> <fo...@driesprong.frl >>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Ry, >>>>>>>> >>>>>>>> You should ask Bolke de Bruin. He's really experienced with Kerberos >>>>> and >>>>>>> he >>>>>>>> did also the implementation for Airflow. Beside that he worked also >>> on >>>>>>>> implementing Kerberos in Ambari. Just want to let you know. >>>>>>>> >>>>>>>> Cheers, Fokko >>>>>>>> >>>>>>>> Op do 26 jul. 2018 om 23:03 schreef Ry Walker <r...@astronomer.io> >>>>>>>> >>>>>>>>> Hi everyone - >>>>>>>>> >>>>>>>>> We have several bigCo's who are considering using Airflow asking >>> into >>>>>>> its >>>>>>>>> support for Kerberos. >>>>>>>>> >>>>>>>>> We're going to work on a proof-of-concept next week, will likely >>>>>>> record a >>>>>>>>> screencast on it. >>>>>>>>> >>>>>>>>> For now, we're looking for any anecdotal information from >>>>> organizations >>>>>>>> who >>>>>>>>> are using Kerberos with Airflow, if anyone would be willing to share >>>>>>>> their >>>>>>>>> experiences here, or reply to me personally, it would be greatly >>>>>>>>> appreciated! >>>>>>>>> >>>>>>>>> -Ry >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> *Ry Walker* | CEO, Astronomer <http://www.astronomer.io/> | >>>>>>>> 513.417.2163 | >>>>>>>>> @rywalker <http://twitter.com/rywalker> | LinkedIn >>>>>>>>> <http://www.linkedin.com/in/rywalker> >>>>>>> >>>>> >>> >