The zip support is a bit of a hack and was a bit controversial when it was
added. I think if we go down the path of supporting more DAG sources, we
should make sure we have the right interface in place so we avoid the
current `if format == zip then: else:` and make sure that we don't tightly
couple to specific DAG sourcing implementations. Personally I feel that
Docker makes more sense than wheels (since they are fully self-contained
even at the binary dependency level), but if we go down the interface route
it might be fine to add support for both Docker and wheels.

On Mon, Dec 16, 2019 at 11:19 AM Björn Pollex
<[email protected]> wrote:

> Hi Jarek,
>
> This sounds great. Is this possibly related to the work started in
> https://github.com/apache/airflow/pull/730? <
> https://github.com/apache/airflow/pull/730?>
>
> I'm not sure I’m following your proposal entirely. Initially, what would
> be a great first step would be to support loading DAGs from entry_point, as
> proposed in the closed PR above. This would already enable most of the
> features you’ve mentioned below. Each DAG could be a Python package, and it
> would carry all the information about required packages in its package
> meta-data.
>
> Is that what you’re envisioning? If so, I’d be happy to support you with
> the implementation!
>
> Also, I think while the idea of creating a temporary virtual environment
> for running tasks is very useful, I’d like this to be optional, as it can
> also create a lot of overhead to running tasks.
>
> Cheers,
>
>         Björn
>
> > On 14. Dec 2019, at 11:10, Jarek Potiuk <[email protected]>
> wrote:
> >
> > I had a lot of interesting discussions last few days with Apache Airflow
> > users at PyDataWarsaw 2019 (I was actually quite surprised how many
> people
> > use Airflow in Poland). One discussion brought an interesting subject:
> > Packaging dags in wheel format. The users mentioned that they are
> > super-happy using .zip-packaged DAGs but they think it could be improved
> > with wheel format (which is also .zip BTW). Maybe it was already
> mentioned
> > in some discussions before but I have not found any.
> >
> > *Context:*
> >
> > We are well on the way of implementing "AIP-21 Changing import paths" and
> > will provide backport packages for Airflow 1.10. As a next step we want
> to
> > target AIP-8.
> > One of the problems to implement AIP-8 (split hooks/operators into
> separate
> > packages) is the problem of dependencies. Different operators/hooks might
> > have different dependencies if maintained separately. Currently we have a
> > common set of dependencies as we have only one setup.py, but if we split
> to
> > separate packages, this might change.
> >
> > *Proposal:*
> >
> > Our users - who love the .zip DAG distribution - proposed that we package
> > the DAGs and all related packages in a wheel package instead of pure
> .zip.
> > This would allow the users to install extra dependencies needed by the
> DAG.
> > And it struck me that we could indeed do that for DAGs but also mitigate
> > most of the dependency problems for separately-packaged operators.
> >
> > The proposal from our users was to package the extra dependencies
> together
> > with the DAG in a wheel file. This is quite cool on it's own, but I
> thought
> > we might actually use the same approach to solve dependency problem with
> > AIP-8.
> >
> > I think we could implement "operator group" -> extra -> "pip packages"
> > dependencies (we need them anyway for AIP-21) and then we could have
> wheel
> > packages with all the "extra" dependencies for each group of operators.
> >
> > Worker executing an operator could have the "core" dependencies installed
> > initially but then when it is supposed to run an operator it could
> create a
> > virtualenv, install the required "extra" from wheels and run the task for
> > this operator in this virtualenv (and remove virtualenv). We could have
> > such package-wheels prepared (one wheel package per operator group) and
> > distributed either same way as DAGs or using some shared binary
> repository
> > (and cached in the worker).
> >
> > Having such dynamically created virtualenv has also the advantage that if
> > someone has a DAG with specific dependencies - they could be embedded in
> > the DAG wheel, installed from it to this virtualenv, and the virtualenv
> > would be removed after the task is finished.
> >
> > The advantage of this approach is that each DAG's extra dependencies are
> > isolated and you could have even different versions of the same
> dependency
> > used by different DAGs. I think that could save a lot of headaches for
> many
> > users.
> >
> > For me that whole idea sounds pretty cool.
> >
> > Let me know what you think.
> >
> > J.
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
>
>

Reply via email to