Re: [DISCUSS] Packaging DAG/operator dependencies in wheels

Claudio Sun, 22 Dec 2019 21:17:30 -0800
I think It's could be cool to add dag versioning, in this way it's possibile to 
fetch a particular version of the dag.What do you think about?Claudio
-------- Messaggio originale --------Da: Chao-Han Tsai <[email protected]> 
Data: 22/12/19  22:35  (GMT+01:00) A: [email protected] Cc: Maxime 
Beauchemin <[email protected]> Oggetto: Re: [DISCUSS] Packaging 
DAG/operator dependencies in wheels Probably it is a good time to 
revisithttps://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+Remote+DAG+Fetcher
 again?On Sun, Dec 22, 2019 at 12:16 PM Jarek Potiuk 
<[email protected]>wrote:> I also love the idea of DAG fetcher, It fits 
very well the "Python-centric"> rather than "Container-centric" approach. 
Fetching it from different> sources like local/ .zip and then .wheel seems like 
an interesting> approach. I think the important parts of whatever approach we 
come up with> are:>> - make it easy for development/iteration by the creator> - 
make it stable/manageable for deployment purpose> - make it manageable for 
incremental updates.>> J.>> On Sun, Dec 22, 2019 at 4:35 PM Tomasz Urbaszek 
<[email protected]>> wrote:>> > I like the idea of a DagFetcher (> > 
https://github.com/apache/airflow/pull/3138).> > I think it's a good and> > 
simple starting point to fetch .py files from places like local file> > system, 
S3 or GCS (that's what> > Composer actually do under the hood). As the next 
step we can think about> > wheels, zip and other> > more demanding packaging.> 
>> > In my opinion in case of such "big" changes we should try to iterate in> > 
small steps. Especially if> > we don't have any strong opinions.> >> > Bests,> 
> Tomek> >> > On Sat, Dec 21, 2019 at 1:23 PM Jarek Potiuk 
<[email protected]>> > wrote:> >> > > I am in "before-Xmas" mood so I 
thought I will write more of my> thoughts> > > about it :).> > >> > > *TL;DR; I 
try to reason (mostly looking at it from the philosophy/usage> > > point of 
view) why container-native approach might not be best for> > Airflow> > > and 
why we should go python-first instead.*> > >> > > I also used to be in the 
"docker" camp as it seemed kinda natural.> Adding> > > DAG layer at package 
runtime seems like a natural thing to do. That> seem> > to> > > fit perfectly 
well some sophisticated production deployment models> where> > > people are 
using docker registry to deploy new software.> > >> > > But in the meantime 
many more questions started to bother me:> > >> > >    - Is it really the case 
for all the deployment models and use cases> > how> > >    Airflow is used?> > 
>    - While it is a good model for some frozen-in-time production> > 
deployment> > >    model, is it a good model to support the whole DAG 
lifecycle?  Think> > > about> > >    initial development, debugging, iteration 
on it, but also> > > post-deployment> > >    maintenance and upgrades?> > >    
- More importantly - does it fit the current philosophy of Airflow> and> > >    
is it expected by its users ?> > >> > > After asking those questions (and 
formulating some answers) I am not so> > > sure any more that containerisation 
should be something Airflow bases> > it's> > > deployment model on.> > >> > > 
After spending a year with Airflow, getting more embedded in it's> > > 
philosophy and talking to the users and especially looking at the> > > 
"competition" we have - I changed my mind here. I don't think Airflow> is> > 
in> > > the "Container-centric" world but it is really "Python-centric" world> 
and> > > it is a conscious choice we should continue with in the future.> > >> 
> > I think there are a number of advantages of Airflow that make it so> > 
popular> > > and really liked by the users. If we go a bit too much into> > > 
"Docker/Container/Cloud Native" world - we might get a bit closer to> some> > > 
of our competitors (think Argo for example) but we might lose quite a> bit> > > 
of an advantage we have. The exact advantage that makes us better for> our> > > 
users, different from competition and also serves quite a bit different> > use> 
> > cases than "general workflow engine".> > >> > > While I am not a 
data-scientist myself, I interacted with data> scientists> > > and data 
engineers a lot (mostly while working as a robotics engineer> at> > > 
NoMagic.ai) and I found that they think and act quite a bit differently> > > 
than DevOps or even traditional Software Engineers. And I think those> > > 
people are our primary users. Looking at the results of our recent> survey> > > 
<https://airflow.apache.org/blog/airflow-survey/> around 70% of> Airflow> > > 
users call themselves "Data Engineer" or "Data Scientist".> > >> > > Let me 
dive a bit deeper.> > >> > > For me when I think "Airflow" - I immediately 
think "Python". There are> > > certain advantages of Airflow being python-first 
and python-focused.> The> > > main advantage is that the same people who are 
able to do data science> > feel> > > comfortable with writing the pipelines and 
use pre-existing> abstractions> > > that make it easier for them to write the 
pipelines> > > (DAGs/Operators/Sensors/...) . Those are mainly data scientist 
who live> > and> > > breathe python as their primary tool of choice. Using 
Jupyter> Notebooks,> > > writing data processing and machine learning 
experiments as python> > scripts> > > is part of their daily job. Docker and 
containers for them are merely> an> > > execution engine for whatever they do 
and while they know about it and> > > realise why containers are useful - it's 
best if they do not have to> > bother> > > about containerisation. Even if they 
use it, it should be pretty much> > > transparent to them. This is in parts the 
reasoning behind developing> > > Breeze - while it uses containers to take 
advantage of isolation and> > > consistent environment for everyone it tries to 
hide the> > > dockerization/containerisation as much as possible and provide a> 
simple,> > > focused interface to manage it. People who know python don't> 
necessarily> > > need to understand containerisation in order to make use of 
it's> > advantage.> > > It's very similar to virtual machines, compilers etc. 
make use of them> > > without really knowing how they work. And it's perfectly 
OK - they> don't> > > have to.> > >> > > Tying the deployment of Airflow DAGs 
has the disadvantage that you have> > to> > > include the whole step of 
packaging, distribution, sharing, and using> the> > > image to be used by the 
"worker" of Airflow. It also basically means> that> > > every task execution of 
Airflow has to be a separate docker container -> > > isolated from the rest, 
started pretty much totally from scratch -> either> > > as part of a new Pod in 
Kubernetes or spun-off as a new container via> > > docker-compose or 
docker-swarm. The whole idea of having separate DAGs> > > which can be updated 
independently and potentially have different> > > dependencies, maybe other 
python code etc. - this means pretty much> that> > > for every single DAG that 
you want to update, you need to package it as> > an> > > extra layer in Docker, 
put it somewhere in a shared registry, and> switch> > > your executors to use 
the new image, get it downloaded by the executor,> > > restart worker somehow 
(to start a container based on that new image).> > > That's a lot of hassle to 
just update one line in a DAG. Surely we can> > > automate that and have it 
fast, but it's quite difficult to explain to> > data> > > scientists that just 
want to change one line in the DAG that they have> to> > > go through that 
process. They would need to understand how to check if> > > their image is 
properly built and distributed, if the executor they run> > > already picked-up 
the new image, if the worker has already picked the> new> > > image - and in 
the case of a spelling mistake they will have to repeat> > that> > > whole 
process again. That's hardly what data scientists are used to.> They> > > are 
used to try something and see results as quickly as possible> without> > > too 
much of a hassle and knowing about some external tooling. This is> the> > > 
whole point of jupyter notebooks for example - you can incrementally> > change> 
> > single step in your whole process and continue iterating on the rest.> > 
This> > > is one of the reasons we loved immediately the idea of Databand.ai 
to> > > develop DebugExecutor> > > 
<https://github.com/apache/airflow/blob/master/TESTING.rst#dag-testing> >> > > 
and> > > we helped in making it merge-ready. It lets the data scientists to> > 
iterate> > > and debug their DAGs using their familiar tools and process (just 
as if> > > they debug a python script) without the hassle of learning new 
tools> and> > > changing the way they work. Tomek will soon write a blog post 
about it,> > but> > > I think it's one of the best productivity improvements we 
could give> our> > > DAG-writing users in a long time.> > >> > > This problem 
is also quite visible with container-native workflow> engines> > > such as Argo 
that force you to have every single step of your workflow> to> > > be a Docker 
container. That sounds great in theory (containers!> > isolation!> > > 
kubernetes!). And it even works perfectly well in a number of practical> > > 
cases. For example when each step require complex processing, a number> of> > > 
dependencies and require different binaries etc. But when you look at> it> > > 
more closely - this is NOT primary use case for Airflow. The primary> use> > > 
case of Airflow is that it talks to other systems via APIs and> > orchestrates> 
> > their work. There is hardly any processing on Airflow worker nodes.> There> 
> > are hardly any new requirements/dependencies needed in most cases. I> > 
really> > > love that Airflow is actually focusing on the "glue" layer between> 
those> > > external services. Again - the same people who do data engineering 
can> > > interact over python API with services they use, put all the steps 
and> > > logic as python code in the same DAG and iterate and change it and> > 
> get immediate feedback - and even add a few lines of code if they need> to> > 
> add an extra parameter or so. Imagine the case where every step of your> > > 
workflow is a Docker container to run - as a data engineer you have to> > use> 
> > python to put the DAG together, then if you want to interact with an> > > 
external service, you have to find an existing container that does it,> > > 
figure out how to pass credentials to this container from your host> (this> > > 
is often non-trivial), and in many cases you find that in order to> > achieve> 
> > what you want you have to build your own image because those available> in> 
> > public registries are old or don't have some features exposed. It> > 
happened> > > to me many times when I tried to use such workflows, I was 
eventually> > > forced to build and deploy somewhere my own Docker image - even 
if I> was> > > just doing iterating and trying different things. That's far 
more> complex> > > than 'pip install <x>' adding '<x> to setup.py' and adding 
one or two> > lines> > > of python code to do what I want. And I am 
super-familiar with Docker.> I> > > leave and breathe Docker. But I can see how 
intimidating and difficult> it> > > must be for people who don't.> > >> > > 
That's why I think that our basic and most common deployment model> (even> > > 
the one used in production) should be based on python toolset - not> > > 
containers. Wheels seems like a great tool for python dependency> > > 
management. I think in most cases when we have just a few dependencies> to> > > 
install per task (for example python google libraries for google tasks)> > > 
from wheel in a running container and create a virtualenv for it - it> > might> 
> > be comparable or even faster than restarting a whole new container with> > 
> those packages installed as a layer. Not mentioning much smaller memory> > 
and> > > cpu overhead if this is done within a running container, rather than> 
> > restarting the whole container for that task. Kubernetes and it's> > > 
deployment models are very well suited for long running tasks that do a> > lot> 
> > of work, but if you want to start a new container that starts the whole> > 
> python interpreter with all dependencies, with it's own CPU/Memory> > > 
requirements *JUST* to have an API call to start external service and> > wait> 
> > for it to finish (most of Airflow tasks are exactly this) - this seems> > 
like> > > a terrible overkill. It seems that the Native Executor> > > 
<https://github.com/apache/airflow/pull/6750> idea discussed in> > > 
sig-scalability group where we abstract away from deployment model and> > use> 
> > queues to communicate and where we keep the worker running to serve> many> 
> > subsequent tasks is much better idea than dedicated executors such as> > > 
KubernetesExecutor which starts a new POD for every task. We should> still> > > 
use containers under the hood of course, and have deployments using> > > 
Kubernetes etc. But this should be transparent to the people who write> > > 
DAGs.> > >> > > Sorry for such a long mail - I just think it's a 
super-important> decision> > > on the philosophy of Airflow, which use cases it 
serves and how well it> > > serves the whole lifecycle of DAGs - from debugging 
to maintenance,> and I> > > think it should really be a foundation of how we 
are implementing some> of> > > the deployment-related features of Airflow 2.0 - 
in order for it to> stay> > > relevant, preferred by our users and focusing on 
those cases that it> does> > > already very well.> > >> > > Let me know what 
you think. But in the meantime - have a great Xmas> > > Everyone!> > >> > > J.> 
> >> > >> > > On Sat, Dec 21, 2019 at 10:42 AM Ash Berlin-Taylor 
<[email protected]>> > wrote:> > >> > > > > For the docker example, you'd almost> 
> > > want to inject or "layer" the DAG script and airflow package at run> > 
time.> > > >> > > > Something sort of like Heroku build packs?> > > >> > > > 
-a> > > >> > > > On 20 December 2019 23:43:30 GMT, Maxime Beauchemin <> > > > 
[email protected]> wrote:> > > > >This reminds me of the "DagFetcher" 
idea. Basically a new> abstraction> > > > >that> > > > >can fetch a DAG object 
from anywhere and run a task. In theory you> > > > >could> > > > >extend it to 
do "zip on s3", "pex on GFS", "docker on artifactory"> or> > > > >whatever 
makes sense to your organization. In the proposal I wrote> > > > >about> > > > 
>using a universal uri scheme to identify DAG artifacts, with support> > > > 
>for> > > > >versioning, as in s3://company_dagbag/some_dag@latest> > > > >> > 
> > >One challenge is around *not* serializing Airflow specific code in> the> > 
> > >artifact/docker, otherwise you end up with a messy heterogenous> > 
cluster> > > > >that runs multiple Airflow versions. For the docker example, 
you'd> > > > >almost> > > > >want to inject or "layer" the DAG script and 
airflow package at run> > > > >time.> > > > >> > > > >Max> > > > >> > > > >On 
Mon, Dec 16, 2019 at 7:17 AM Dan Davydov> > > > 
><[email protected]>> > > > >wrote:> > > > >> > > > >> The zip 
support is a bit of a hack and was a bit controversial> when> > > > >it was> > 
> > >> added. I think if we go down the path of supporting more DAG> > 
sources,> > > > >we> > > > >> should make sure we have the right interface in 
place so we avoid> > the> > > > >> current `if format == zip then: else:` and 
make sure that we don't> > > > >tightly> > > > >> couple to specific DAG 
sourcing implementations. Personally I feel> > > > >that> > > > >> Docker makes 
more sense than wheels (since they are fully> > > > >self-contained> > > > >> 
even at the binary dependency level), but if we go down the> > interface> > > > 
>route> > > > >> it might be fine to add support for both Docker and wheels.> > 
> > >>> > > > >> On Mon, Dec 16, 2019 at 11:19 AM Björn Pollex> > > > >> 
<[email protected]> wrote:> > > > >>> > > > >> > Hi Jarek,> 
> > > >> >> > > > >> > This sounds great. Is this possibly related to the work 
started> in> > > > >> > https://github.com/apache/airflow/pull/730? <> > > > >> 
> https://github.com/apache/airflow/pull/730?>> > > > >> >> > > > >> > I'm not 
sure I’m following your proposal entirely. Initially,> what> > > > >would> > > 
> >> > be a great first step would be to support loading DAGs from> > > > 
>entry_point,> > > > >> as> > > > >> > proposed in the closed PR above. This 
would already enable most> of> > > > >the> > > > >> > features you’ve mentioned 
below. Each DAG could be a Python> > > > >package, and> > > > >> it> > > > >> > 
would carry all the information about required packages in its> > > > >package> 
> > > >> > meta-data.> > > > >> >> > > > >> > Is that what you’re envisioning? 
If so, I’d be happy to support> > you> > > > >with> > > > >> > the 
implementation!> > > > >> >> > > > >> > Also, I think while the idea of 
creating a temporary virtual> > > > >environment> > > > >> > for running tasks 
is very useful, I’d like this to be optional,> as> > > > >it can> > > > >> > 
also create a lot of overhead to running tasks.> > > > >> >> > > > >> > 
Cheers,> > > > >> >> > > > >> >         Björn> > > > >> >> > > > >> > > On 14. 
Dec 2019, at 11:10, Jarek Potiuk> > > > ><[email protected]>> > > > >> > 
wrote:> > > > >> > >> > > > >> > > I had a lot of interesting discussions last 
few days with> Apache> > > > >> Airflow> > > > >> > > users at PyDataWarsaw 
2019 (I was actually quite surprised how> > > > >many> > > > >> > people> > > > 
>> > > use Airflow in Poland). One discussion brought an interesting> > > > 
>subject:> > > > >> > > Packaging dags in wheel format. The users mentioned 
that they> > are> > > > >> > > super-happy using .zip-packaged DAGs but they 
think it could> be> > > > >> improved> > > > >> > > with wheel format (which is 
also .zip BTW). Maybe it was> already> > > > >> > mentioned> > > > >> > > in 
some discussions before but I have not found any.> > > > >> > >> > > > >> > > 
*Context:*> > > > >> > >> > > > >> > > We are well on the way of implementing 
"AIP-21 Changing import> > > > >paths"> > > > >> and> > > > >> > > will provide 
backport packages for Airflow 1.10. As a next> step> > > > >we want> > > > >> > 
to> > > > >> > > target AIP-8.> > > > >> > > One of the problems to implement 
AIP-8 (split hooks/operators> > > > >into> > > > >> > separate> > > > >> > > 
packages) is the problem of dependencies. Different> > > > >operators/hooks> > 
> > >> might> > > > >> > > have different dependencies if maintained 
separately.> Currently> > > > >we> > > > >> have a> > > > >> > > common set of 
dependencies as we have only one setup.py, but> if> > > > >we> > > > >> split> 
> > > >> > to> > > > >> > > separate packages, this might change.> > > > >> > 
>> > > > >> > > *Proposal:*> > > > >> > >> > > > >> > > Our users - who love 
the .zip DAG distribution - proposed that> > we> > > > >> package> > > > >> > > 
the DAGs and all related packages in a wheel package instead> of> > > > >pure> 
> > > >> > .zip.> > > > >> > > This would allow the users to install extra 
dependencies> needed> > > > >by the> > > > >> > DAG.> > > > >> > > And it 
struck me that we could indeed do that for DAGs but> also> > > > >> mitigate> > 
> > >> > > most of the dependency problems for separately-packaged> > > > 
>operators.> > > > >> > >> > > > >> > > The proposal from our users was to 
package the extra> > dependencies> > > > >> > together> > > > >> > > with the 
DAG in a wheel file. This is quite cool on it's own,> > but> > > > >I> > > > >> 
> thought> > > > >> > > we might actually use the same approach to solve 
dependency> > > > >problem> > > > >> with> > > > >> > > AIP-8.> > > > >> > >> > 
> > >> > > I think we could implement "operator group" -> extra -> "pip> > > > 
>packages"> > > > >> > > dependencies (we need them anyway for AIP-21) and then 
we> could> > > > >have> > > > >> > wheel> > > > >> > > packages with all the 
"extra" dependencies for each group of> > > > >operators.> > > > >> > >> > > > 
>> > > Worker executing an operator could have the "core"> dependencies> > > > 
>> installed> > > > >> > > initially but then when it is supposed to run an 
operator it> > > > >could> > > > >> > create a> > > > >> > > virtualenv, 
install the required "extra" from wheels and run> the> > > > >task> > > > >> 
for> > > > >> > > this operator in this virtualenv (and remove virtualenv). We> 
> > > >could have> > > > >> > > such package-wheels prepared (one wheel package 
per operator> > > > >group) and> > > > >> > > distributed either same way as 
DAGs or using some shared> binary> > > > >> > repository> > > > >> > > (and 
cached in the worker).> > > > >> > >> > > > >> > > Having such dynamically 
created virtualenv has also the> > advantage> > > > >that> > > > >> if> > > > 
>> > > someone has a DAG with specific dependencies - they could be> > > > 
>embedded> > > > >> in> > > > >> > > the DAG wheel, installed from it to this 
virtualenv, and the> > > > >virtualenv> > > > >> > > would be removed after the 
task is finished.> > > > >> > >> > > > >> > > The advantage of this approach is 
that each DAG's extra> > > > >dependencies> > > > >> are> > > > >> > > isolated 
and you could have even different versions of the> same> > > > >> > dependency> 
> > > >> > > used by different DAGs. I think that could save a lot of> > > > 
>headaches for> > > > >> > many> > > > >> > > users.> > > > >> > >> > > > >> > 
> For me that whole idea sounds pretty cool.> > > > >> > >> > > > >> > > Let me 
know what you think.> > > > >> > >> > > > >> > > J.> > > > >> > >> > > > >> > 
>> > > > >> > > --> > > > >> > >> > > > >> > > Jarek Potiuk> > > > >> > > 
Polidea <https://www.polidea.com/> | Principal Software> > Engineer> > > > >> > 
>> > > > >> > > M: +48 660 796 129 <+48660796129>> > > > >> > > [image: 
Polidea] <https://www.polidea.com/>> > > > >> >> > > > >> >> > > > >>> > > >> > 
>> > >> > > --> > >> > > Jarek Potiuk> > > Polidea <https://www.polidea.com/> | 
Principal Software Engineer> > >> > > M: +48 660 796 129 <+48660796129>> > > 
[image: Polidea] <https://www.polidea.com/>> > >> >>>> -->> Jarek Potiuk> 
Polidea <https://www.polidea.com/> | Principal Software Engineer>> M: +48 660 
796 129 <+48660796129>> [image: Polidea] <https://www.polidea.com/>>-- Chao-Han 
Tsai
Re: [DISCUSS] Packaging DAG/operator dependencies in wheels

Reply via email to