I think It's could be cool to add dag versioning, in this way it's possibile to
fetch a particular version of the dag.What do you think about?Claudio
-------- Messaggio originale --------Da: Chao-Han Tsai <[email protected]>
Data: 22/12/19 22:35 (GMT+01:00) A: [email protected] Cc: Maxime
Beauchemin <[email protected]> Oggetto: Re: [DISCUSS] Packaging
DAG/operator dependencies in wheels Probably it is a good time to
revisithttps://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+Remote+DAG+Fetcher
again?On Sun, Dec 22, 2019 at 12:16 PM Jarek Potiuk
<[email protected]>wrote:> I also love the idea of DAG fetcher, It fits
very well the "Python-centric"> rather than "Container-centric" approach.
Fetching it from different> sources like local/ .zip and then .wheel seems like
an interesting> approach. I think the important parts of whatever approach we
come up with> are:>> - make it easy for development/iteration by the creator> -
make it stable/manageable for deployment purpose> - make it manageable for
incremental updates.>> J.>> On Sun, Dec 22, 2019 at 4:35 PM Tomasz Urbaszek
<[email protected]>> wrote:>> > I like the idea of a DagFetcher (> >
https://github.com/apache/airflow/pull/3138).> > I think it's a good and> >
simple starting point to fetch .py files from places like local file> > system,
S3 or GCS (that's what> > Composer actually do under the hood). As the next
step we can think about> > wheels, zip and other> > more demanding packaging.>
>> > In my opinion in case of such "big" changes we should try to iterate in> >
small steps. Especially if> > we don't have any strong opinions.> >> > Bests,>
> Tomek> >> > On Sat, Dec 21, 2019 at 1:23 PM Jarek Potiuk
<[email protected]>> > wrote:> >> > > I am in "before-Xmas" mood so I
thought I will write more of my> thoughts> > > about it :).> > >> > > *TL;DR; I
try to reason (mostly looking at it from the philosophy/usage> > > point of
view) why container-native approach might not be best for> > Airflow> > > and
why we should go python-first instead.*> > >> > > I also used to be in the
"docker" camp as it seemed kinda natural.> Adding> > > DAG layer at package
runtime seems like a natural thing to do. That> seem> > to> > > fit perfectly
well some sophisticated production deployment models> where> > > people are
using docker registry to deploy new software.> > >> > > But in the meantime
many more questions started to bother me:> > >> > > - Is it really the case
for all the deployment models and use cases> > how> > > Airflow is used?> >
> - While it is a good model for some frozen-in-time production> >
deployment> > > model, is it a good model to support the whole DAG
lifecycle? Think> > > about> > > initial development, debugging, iteration
on it, but also> > > post-deployment> > > maintenance and upgrades?> > >
- More importantly - does it fit the current philosophy of Airflow> and> > >
is it expected by its users ?> > >> > > After asking those questions (and
formulating some answers) I am not so> > > sure any more that containerisation
should be something Airflow bases> > it's> > > deployment model on.> > >> > >
After spending a year with Airflow, getting more embedded in it's> > >
philosophy and talking to the users and especially looking at the> > >
"competition" we have - I changed my mind here. I don't think Airflow> is> >
in> > > the "Container-centric" world but it is really "Python-centric" world>
and> > > it is a conscious choice we should continue with in the future.> > >>
> > I think there are a number of advantages of Airflow that make it so> >
popular> > > and really liked by the users. If we go a bit too much into> > >
"Docker/Container/Cloud Native" world - we might get a bit closer to> some> > >
of our competitors (think Argo for example) but we might lose quite a> bit> > >
of an advantage we have. The exact advantage that makes us better for> our> > >
users, different from competition and also serves quite a bit different> > use>
> > cases than "general workflow engine".> > >> > > While I am not a
data-scientist myself, I interacted with data> scientists> > > and data
engineers a lot (mostly while working as a robotics engineer> at> > >
NoMagic.ai) and I found that they think and act quite a bit differently> > >
than DevOps or even traditional Software Engineers. And I think those> > >
people are our primary users. Looking at the results of our recent> survey> > >
<https://airflow.apache.org/blog/airflow-survey/> around 70% of> Airflow> > >
users call themselves "Data Engineer" or "Data Scientist".> > >> > > Let me
dive a bit deeper.> > >> > > For me when I think "Airflow" - I immediately
think "Python". There are> > > certain advantages of Airflow being python-first
and python-focused.> The> > > main advantage is that the same people who are
able to do data science> > feel> > > comfortable with writing the pipelines and
use pre-existing> abstractions> > > that make it easier for them to write the
pipelines> > > (DAGs/Operators/Sensors/...) . Those are mainly data scientist
who live> > and> > > breathe python as their primary tool of choice. Using
Jupyter> Notebooks,> > > writing data processing and machine learning
experiments as python> > scripts> > > is part of their daily job. Docker and
containers for them are merely> an> > > execution engine for whatever they do
and while they know about it and> > > realise why containers are useful - it's
best if they do not have to> > bother> > > about containerisation. Even if they
use it, it should be pretty much> > > transparent to them. This is in parts the
reasoning behind developing> > > Breeze - while it uses containers to take
advantage of isolation and> > > consistent environment for everyone it tries to
hide the> > > dockerization/containerisation as much as possible and provide a>
simple,> > > focused interface to manage it. People who know python don't>
necessarily> > > need to understand containerisation in order to make use of
it's> > advantage.> > > It's very similar to virtual machines, compilers etc.
make use of them> > > without really knowing how they work. And it's perfectly
OK - they> don't> > > have to.> > >> > > Tying the deployment of Airflow DAGs
has the disadvantage that you have> > to> > > include the whole step of
packaging, distribution, sharing, and using> the> > > image to be used by the
"worker" of Airflow. It also basically means> that> > > every task execution of
Airflow has to be a separate docker container -> > > isolated from the rest,
started pretty much totally from scratch -> either> > > as part of a new Pod in
Kubernetes or spun-off as a new container via> > > docker-compose or
docker-swarm. The whole idea of having separate DAGs> > > which can be updated
independently and potentially have different> > > dependencies, maybe other
python code etc. - this means pretty much> that> > > for every single DAG that
you want to update, you need to package it as> > an> > > extra layer in Docker,
put it somewhere in a shared registry, and> switch> > > your executors to use
the new image, get it downloaded by the executor,> > > restart worker somehow
(to start a container based on that new image).> > > That's a lot of hassle to
just update one line in a DAG. Surely we can> > > automate that and have it
fast, but it's quite difficult to explain to> > data> > > scientists that just
want to change one line in the DAG that they have> to> > > go through that
process. They would need to understand how to check if> > > their image is
properly built and distributed, if the executor they run> > > already picked-up
the new image, if the worker has already picked the> new> > > image - and in
the case of a spelling mistake they will have to repeat> > that> > > whole
process again. That's hardly what data scientists are used to.> They> > > are
used to try something and see results as quickly as possible> without> > > too
much of a hassle and knowing about some external tooling. This is> the> > >
whole point of jupyter notebooks for example - you can incrementally> > change>
> > single step in your whole process and continue iterating on the rest.> >
This> > > is one of the reasons we loved immediately the idea of Databand.ai
to> > > develop DebugExecutor> > >
<https://github.com/apache/airflow/blob/master/TESTING.rst#dag-testing> >> > >
and> > > we helped in making it merge-ready. It lets the data scientists to> >
iterate> > > and debug their DAGs using their familiar tools and process (just
as if> > > they debug a python script) without the hassle of learning new
tools> and> > > changing the way they work. Tomek will soon write a blog post
about it,> > but> > > I think it's one of the best productivity improvements we
could give> our> > > DAG-writing users in a long time.> > >> > > This problem
is also quite visible with container-native workflow> engines> > > such as Argo
that force you to have every single step of your workflow> to> > > be a Docker
container. That sounds great in theory (containers!> > isolation!> > >
kubernetes!). And it even works perfectly well in a number of practical> > >
cases. For example when each step require complex processing, a number> of> > >
dependencies and require different binaries etc. But when you look at> it> > >
more closely - this is NOT primary use case for Airflow. The primary> use> > >
case of Airflow is that it talks to other systems via APIs and> > orchestrates>
> > their work. There is hardly any processing on Airflow worker nodes.> There>
> > are hardly any new requirements/dependencies needed in most cases. I> >
really> > > love that Airflow is actually focusing on the "glue" layer between>
those> > > external services. Again - the same people who do data engineering
can> > > interact over python API with services they use, put all the steps
and> > > logic as python code in the same DAG and iterate and change it and> >
> get immediate feedback - and even add a few lines of code if they need> to> >
> add an extra parameter or so. Imagine the case where every step of your> > >
workflow is a Docker container to run - as a data engineer you have to> > use>
> > python to put the DAG together, then if you want to interact with an> > >
external service, you have to find an existing container that does it,> > >
figure out how to pass credentials to this container from your host> (this> > >
is often non-trivial), and in many cases you find that in order to> > achieve>
> > what you want you have to build your own image because those available> in>
> > public registries are old or don't have some features exposed. It> >
happened> > > to me many times when I tried to use such workflows, I was
eventually> > > forced to build and deploy somewhere my own Docker image - even
if I> was> > > just doing iterating and trying different things. That's far
more> complex> > > than 'pip install <x>' adding '<x> to setup.py' and adding
one or two> > lines> > > of python code to do what I want. And I am
super-familiar with Docker.> I> > > leave and breathe Docker. But I can see how
intimidating and difficult> it> > > must be for people who don't.> > >> > >
That's why I think that our basic and most common deployment model> (even> > >
the one used in production) should be based on python toolset - not> > >
containers. Wheels seems like a great tool for python dependency> > >
management. I think in most cases when we have just a few dependencies> to> > >
install per task (for example python google libraries for google tasks)> > >
from wheel in a running container and create a virtualenv for it - it> > might>
> > be comparable or even faster than restarting a whole new container with> >
> those packages installed as a layer. Not mentioning much smaller memory> >
and> > > cpu overhead if this is done within a running container, rather than>
> > restarting the whole container for that task. Kubernetes and it's> > >
deployment models are very well suited for long running tasks that do a> > lot>
> > of work, but if you want to start a new container that starts the whole> >
> python interpreter with all dependencies, with it's own CPU/Memory> > >
requirements *JUST* to have an API call to start external service and> > wait>
> > for it to finish (most of Airflow tasks are exactly this) - this seems> >
like> > > a terrible overkill. It seems that the Native Executor> > >
<https://github.com/apache/airflow/pull/6750> idea discussed in> > >
sig-scalability group where we abstract away from deployment model and> > use>
> > queues to communicate and where we keep the worker running to serve> many>
> > subsequent tasks is much better idea than dedicated executors such as> > >
KubernetesExecutor which starts a new POD for every task. We should> still> > >
use containers under the hood of course, and have deployments using> > >
Kubernetes etc. But this should be transparent to the people who write> > >
DAGs.> > >> > > Sorry for such a long mail - I just think it's a
super-important> decision> > > on the philosophy of Airflow, which use cases it
serves and how well it> > > serves the whole lifecycle of DAGs - from debugging
to maintenance,> and I> > > think it should really be a foundation of how we
are implementing some> of> > > the deployment-related features of Airflow 2.0 -
in order for it to> stay> > > relevant, preferred by our users and focusing on
those cases that it> does> > > already very well.> > >> > > Let me know what
you think. But in the meantime - have a great Xmas> > > Everyone!> > >> > > J.>
> >> > >> > > On Sat, Dec 21, 2019 at 10:42 AM Ash Berlin-Taylor
<[email protected]>> > wrote:> > >> > > > > For the docker example, you'd almost>
> > > want to inject or "layer" the DAG script and airflow package at run> >
time.> > > >> > > > Something sort of like Heroku build packs?> > > >> > > >
-a> > > >> > > > On 20 December 2019 23:43:30 GMT, Maxime Beauchemin <> > > >
[email protected]> wrote:> > > > >This reminds me of the "DagFetcher"
idea. Basically a new> abstraction> > > > >that> > > > >can fetch a DAG object
from anywhere and run a task. In theory you> > > > >could> > > > >extend it to
do "zip on s3", "pex on GFS", "docker on artifactory"> or> > > > >whatever
makes sense to your organization. In the proposal I wrote> > > > >about> > > >
>using a universal uri scheme to identify DAG artifacts, with support> > > >
>for> > > > >versioning, as in s3://company_dagbag/some_dag@latest> > > > >> >
> > >One challenge is around *not* serializing Airflow specific code in> the> >
> > >artifact/docker, otherwise you end up with a messy heterogenous> >
cluster> > > > >that runs multiple Airflow versions. For the docker example,
you'd> > > > >almost> > > > >want to inject or "layer" the DAG script and
airflow package at run> > > > >time.> > > > >> > > > >Max> > > > >> > > > >On
Mon, Dec 16, 2019 at 7:17 AM Dan Davydov> > > >
><[email protected]>> > > > >wrote:> > > > >> > > > >> The zip
support is a bit of a hack and was a bit controversial> when> > > > >it was> >
> > >> added. I think if we go down the path of supporting more DAG> >
sources,> > > > >we> > > > >> should make sure we have the right interface in
place so we avoid> > the> > > > >> current `if format == zip then: else:` and
make sure that we don't> > > > >tightly> > > > >> couple to specific DAG
sourcing implementations. Personally I feel> > > > >that> > > > >> Docker makes
more sense than wheels (since they are fully> > > > >self-contained> > > > >>
even at the binary dependency level), but if we go down the> > interface> > > >
>route> > > > >> it might be fine to add support for both Docker and wheels.> >
> > >>> > > > >> On Mon, Dec 16, 2019 at 11:19 AM Björn Pollex> > > > >>
<[email protected]> wrote:> > > > >>> > > > >> > Hi Jarek,>
> > > >> >> > > > >> > This sounds great. Is this possibly related to the work
started> in> > > > >> > https://github.com/apache/airflow/pull/730? <> > > > >>
> https://github.com/apache/airflow/pull/730?>> > > > >> >> > > > >> > I'm not
sure I’m following your proposal entirely. Initially,> what> > > > >would> > >
> >> > be a great first step would be to support loading DAGs from> > > >
>entry_point,> > > > >> as> > > > >> > proposed in the closed PR above. This
would already enable most> of> > > > >the> > > > >> > features you’ve mentioned
below. Each DAG could be a Python> > > > >package, and> > > > >> it> > > > >> >
would carry all the information about required packages in its> > > > >package>
> > > >> > meta-data.> > > > >> >> > > > >> > Is that what you’re envisioning?
If so, I’d be happy to support> > you> > > > >with> > > > >> > the
implementation!> > > > >> >> > > > >> > Also, I think while the idea of
creating a temporary virtual> > > > >environment> > > > >> > for running tasks
is very useful, I’d like this to be optional,> as> > > > >it can> > > > >> >
also create a lot of overhead to running tasks.> > > > >> >> > > > >> >
Cheers,> > > > >> >> > > > >> > Björn> > > > >> >> > > > >> > > On 14.
Dec 2019, at 11:10, Jarek Potiuk> > > > ><[email protected]>> > > > >> >
wrote:> > > > >> > >> > > > >> > > I had a lot of interesting discussions last
few days with> Apache> > > > >> Airflow> > > > >> > > users at PyDataWarsaw
2019 (I was actually quite surprised how> > > > >many> > > > >> > people> > > >
>> > > use Airflow in Poland). One discussion brought an interesting> > > >
>subject:> > > > >> > > Packaging dags in wheel format. The users mentioned
that they> > are> > > > >> > > super-happy using .zip-packaged DAGs but they
think it could> be> > > > >> improved> > > > >> > > with wheel format (which is
also .zip BTW). Maybe it was> already> > > > >> > mentioned> > > > >> > > in
some discussions before but I have not found any.> > > > >> > >> > > > >> > >
*Context:*> > > > >> > >> > > > >> > > We are well on the way of implementing
"AIP-21 Changing import> > > > >paths"> > > > >> and> > > > >> > > will provide
backport packages for Airflow 1.10. As a next> step> > > > >we want> > > > >> >
to> > > > >> > > target AIP-8.> > > > >> > > One of the problems to implement
AIP-8 (split hooks/operators> > > > >into> > > > >> > separate> > > > >> > >
packages) is the problem of dependencies. Different> > > > >operators/hooks> >
> > >> might> > > > >> > > have different dependencies if maintained
separately.> Currently> > > > >we> > > > >> have a> > > > >> > > common set of
dependencies as we have only one setup.py, but> if> > > > >we> > > > >> split>
> > > >> > to> > > > >> > > separate packages, this might change.> > > > >> >
>> > > > >> > > *Proposal:*> > > > >> > >> > > > >> > > Our users - who love
the .zip DAG distribution - proposed that> > we> > > > >> package> > > > >> > >
the DAGs and all related packages in a wheel package instead> of> > > > >pure>
> > > >> > .zip.> > > > >> > > This would allow the users to install extra
dependencies> needed> > > > >by the> > > > >> > DAG.> > > > >> > > And it
struck me that we could indeed do that for DAGs but> also> > > > >> mitigate> >
> > >> > > most of the dependency problems for separately-packaged> > > >
>operators.> > > > >> > >> > > > >> > > The proposal from our users was to
package the extra> > dependencies> > > > >> > together> > > > >> > > with the
DAG in a wheel file. This is quite cool on it's own,> > but> > > > >I> > > > >>
> thought> > > > >> > > we might actually use the same approach to solve
dependency> > > > >problem> > > > >> with> > > > >> > > AIP-8.> > > > >> > >> >
> > >> > > I think we could implement "operator group" -> extra -> "pip> > > >
>packages"> > > > >> > > dependencies (we need them anyway for AIP-21) and then
we> could> > > > >have> > > > >> > wheel> > > > >> > > packages with all the
"extra" dependencies for each group of> > > > >operators.> > > > >> > >> > > >
>> > > Worker executing an operator could have the "core"> dependencies> > > >
>> installed> > > > >> > > initially but then when it is supposed to run an
operator it> > > > >could> > > > >> > create a> > > > >> > > virtualenv,
install the required "extra" from wheels and run> the> > > > >task> > > > >>
for> > > > >> > > this operator in this virtualenv (and remove virtualenv). We>
> > > >could have> > > > >> > > such package-wheels prepared (one wheel package
per operator> > > > >group) and> > > > >> > > distributed either same way as
DAGs or using some shared> binary> > > > >> > repository> > > > >> > > (and
cached in the worker).> > > > >> > >> > > > >> > > Having such dynamically
created virtualenv has also the> > advantage> > > > >that> > > > >> if> > > >
>> > > someone has a DAG with specific dependencies - they could be> > > >
>embedded> > > > >> in> > > > >> > > the DAG wheel, installed from it to this
virtualenv, and the> > > > >virtualenv> > > > >> > > would be removed after the
task is finished.> > > > >> > >> > > > >> > > The advantage of this approach is
that each DAG's extra> > > > >dependencies> > > > >> are> > > > >> > > isolated
and you could have even different versions of the> same> > > > >> > dependency>
> > > >> > > used by different DAGs. I think that could save a lot of> > > >
>headaches for> > > > >> > many> > > > >> > > users.> > > > >> > >> > > > >> >
> For me that whole idea sounds pretty cool.> > > > >> > >> > > > >> > > Let me
know what you think.> > > > >> > >> > > > >> > > J.> > > > >> > >> > > > >> >
>> > > > >> > > --> > > > >> > >> > > > >> > > Jarek Potiuk> > > > >> > >
Polidea <https://www.polidea.com/> | Principal Software> > Engineer> > > > >> >
>> > > > >> > > M: +48 660 796 129 <+48660796129>> > > > >> > > [image:
Polidea] <https://www.polidea.com/>> > > > >> >> > > > >> >> > > > >>> > > >> >
>> > >> > > --> > >> > > Jarek Potiuk> > > Polidea <https://www.polidea.com/> |
Principal Software Engineer> > >> > > M: +48 660 796 129 <+48660796129>> > >
[image: Polidea] <https://www.polidea.com/>> > >> >>>> -->> Jarek Potiuk>
Polidea <https://www.polidea.com/> | Principal Software Engineer>> M: +48 660
796 129 <+48660796129>> [image: Polidea] <https://www.polidea.com/>>-- Chao-Han
Tsai