I had a lot of interesting discussions last few days with Apache Airflow
users at PyDataWarsaw 2019 (I was actually quite surprised how many people
use Airflow in Poland). One discussion brought an interesting subject:
Packaging dags in wheel format. The users mentioned that they are
super-happy using .zip-packaged DAGs but they think it could be improved
with wheel format (which is also .zip BTW). Maybe it was already mentioned
in some discussions before but I have not found any.

*Context:*

We are well on the way of implementing "AIP-21 Changing import paths" and
will provide backport packages for Airflow 1.10. As a next step we want to
target AIP-8.
One of the problems to implement AIP-8 (split hooks/operators into separate
packages) is the problem of dependencies. Different operators/hooks might
have different dependencies if maintained separately. Currently we have a
common set of dependencies as we have only one setup.py, but if we split to
separate packages, this might change.

*Proposal:*

Our users - who love the .zip DAG distribution - proposed that we package
the DAGs and all related packages in a wheel package instead of pure .zip.
This would allow the users to install extra dependencies needed by the DAG.
And it struck me that we could indeed do that for DAGs but also mitigate
most of the dependency problems for separately-packaged operators.

The proposal from our users was to package the extra dependencies together
with the DAG in a wheel file. This is quite cool on it's own, but I thought
we might actually use the same approach to solve dependency problem with
AIP-8.

I think we could implement "operator group" -> extra -> "pip packages"
dependencies (we need them anyway for AIP-21) and then we could have wheel
packages with all the "extra" dependencies for each group of operators.

Worker executing an operator could have the "core" dependencies installed
initially but then when it is supposed to run an operator it could create a
virtualenv, install the required "extra" from wheels and run the task for
this operator in this virtualenv (and remove virtualenv). We could have
such package-wheels prepared (one wheel package per operator group) and
distributed either same way as DAGs or using some shared binary repository
(and cached in the worker).

Having such dynamically created virtualenv has also the advantage that if
someone has a DAG with specific dependencies - they could be embedded in
the DAG wheel, installed from it to this virtualenv, and the virtualenv
would be removed after the task is finished.

The advantage of this approach is that each DAG's extra dependencies are
isolated and you could have even different versions of the same dependency
used by different DAGs. I think that could save a lot of headaches for many
users.

For me that whole idea sounds pretty cool.

Let me know what you think.

J.


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to