I've been running Airflow on 1500 cores in the context of scientific
workflows for the past year and a half. Features that would be important to
me for 2.0:

- Add FK to dag_run to the task_instance table on Postgres so that
task_instances can be uniquely attributed to dag runs.
- Ensure scheduler can be run continuously without needing restarts. Right
now it gets into some ill-determined bad state forcing me to restart it
every 20 minutes.
- Ensure scheduler can handle tens of thousands of active workflows. Right
now this results in extremely long scheduling times and inconsistent
scheduling even at 2 thousand active workflows.
- Add more flexible task scheduling prioritization. The default
prioritization is the opposite of the behaviour I want. I would prefer that
downstream tasks always have higher priority than upstream tasks to cause
entire workflows to tend to complete sooner, rather than scheduling tasks
from other workflows. Having a few scheduling prioritization strategies
would be beneficial here.
- Provide better support for manually-triggered DAGs on the UI i.e. by
showing them as queued.
- Provide some resource management capabilities via something like slots
that can be defined on workers and occupied by tasks. Using celery's
concurrency parameter at the airflow server level is too coarse-grained as
it forces all workers to be the same, and does not allow proper resource
management when different workflow tasks have different resource
requirements thus hurting utilization (a worker could run 8 parallel tasks
with small memory footprint, but only 1 task with large memory footprint
for instance).

With best regards,

Sergei.


On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <ext-pavlo.ryabc...@here.com>
wrote:

> -1. We extremely rely on data profiling, as a pipeline health monitoring
> tool
>
> -----Original Message-----
> From: Chris Riccomini [mailto:criccom...@apache.org]
> Sent: Saturday, November 19, 2016 1:57 AM
> To: dev@airflow.incubator.apache.org
> Subject: Re: Airflow 2.0
>
> > RIP out the charting application and the data profiler
>
> Yes please! +1
>
> On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
> > Another point that may be controversial for Airflow 2.0: RIP out the
> > charting application and the data profiler. Even though it's nice to
> > have it there, it's just out of scope and has major security
> issues/implications.
> >
> > I'm not sure how popular it actually is. We may need to run a survey
> > at some point around this kind of questions.
> >
> > Max
> >
> > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
> > maximebeauche...@gmail.com> wrote:
> >
> >> Using FAB's Model, we get pretty much all of that (REST API,
> >> auth/perms,
> >> CRUD) for free:
> >> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ffla
> >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7C%7C0064f
> >> 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea64919%7C1&sd
> >> ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0
> >> quickhowto.html?highlight=rest#exposed-methods
> >>
> >> I'm pretty intimate with FAB since I use it (and contributed to it)
> >> for Superset/Caravel.
> >>
> >> All that's needed is to derive FAB's model class instead of
> >> SqlAlchemy's model class (which FAB's model wraps and adds
> >> functionality to and is 100% compatible AFAICT).
> >>
> >> Max
> >>
> >> On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini
> >> <criccom...@apache.org>
> >> wrote:
> >>
> >>> > It may be doable to run this as a different package
> >>> `airflow-webserver`, an
> >>> > alternate UI at first, and to eventually rip out the old UI off of
> >>> > the
> >>> main
> >>> > package.
> >>>
> >>> This is the same strategy that I was thinking of for AIRFLOW-85. You
> >>> can build the new UI in parallel, and then delete the old one later.
> >>> I really think that a REST interface should be a pre-req to any
> >>> large/new UI changes, though. Getting unified so that everything is
> >>> driven through REST will be a big win.
> >>>
> >>> On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin
> >>> <maximebeauche...@gmail.com> wrote:
> >>> > A multi-tenant UI with composable roles on top of granular
> permissions.
> >>> >
> >>> > Migrating from Flask-Admin to Flask App Builder would be an
> >>> > easy-ish win (since they're both Flask). FAB Provides a good
> >>> > authentication and permission model that ships out-of-the-box with
> >>> > a REST api. Suffice to define FAB models (derivative of
> >>> > SQLAlchemy's model) and you get a set
> >>> of
> >>> > perms for the model (can_show, can_list, can_add, can_change,
> >>> can_delete,
> >>> > ...) and a set of CRUD REST endpoints. It would also allow us to
> >>> > rip out the authentication backend code out of Airflow and rely on
> FAB for that.
> >>> > Also every single view gets permissions auto-created for it, and
> >>> > there
> >>> are
> >>> > easy way to define row-level type filters based on user permissions.
> >>> >
> >>> > It may be doable to run this as a different package
> >>> `airflow-webserver`, an
> >>> > alternate UI at first, and to eventually rip out the old UI off of
> >>> > the
> >>> main
> >>> > package.
> >>> >
> >>> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2
> >>> > Fflask-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7C%
> >>> > 7C0064f74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea64
> >>> > 919%7C1&sdata=8mUPRcf4%2FQUDSbju%2BjLLImalhZeU7tOA%2BFpeO%2BjcEs8%
> >>> > 3D&reserved=0
> >>> >
> >>> > I'd love to carve some time and lead this.
> >>> >
> >>> > On Fri, Nov 18, 2016 at 1:32 PM, Chris Riccomini
> >>> > <criccom...@apache.org
> >>> >
> >>> > wrote:
> >>> >
> >>> >> Full-fledged REST API (that the UI also uses) would be great in 2.0.
> >>> >>
> >>> >> On Fri, Nov 18, 2016 at 6:26 AM, David Kegley <k...@b23.io> wrote:
> >>> >> > Hi All,
> >>> >> >
> >>> >> > We have been using Airflow heavily for the last couple months
> >>> >> > and
> >>> it’s
> >>> >> been great so far. Here are a few things we’d like to see
> >>> >> prioritized
> >>> in
> >>> >> 2.0.
> >>> >> >
> >>> >> > 1) Role based access to DAGs:
> >>> >> > We would like to see better role based access through the UI.
> >>> There’s a
> >>> >> related ticket out there but it hasn’t seen any action in a few
> >>> >> months
> >>> >> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2
> >>> >> > F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-85&data=01%7C01
> >>> >> > %7C%7C0064f74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391
> >>> >> > feaea64919%7C1&sdata=VsgwHZxr0%2FDQN1jeBTJsfyIGu%2FZkkWhzAvxNvB
> >>> >> > N531k%3D&reserved=0
> >>> >> >
> >>> >> > We use a templating system to create/deploy DAGs dynamically
> >>> >> > based on
> >>> >> some directory/file structure. This allows analysts to quickly
> >>> >> deploy
> >>> and
> >>> >> schedule their ETL code without having to interact with the
> >>> >> Airflow installation directly. It would be great if those same
> >>> >> analysts could access to their own DAGs in the UI so that they
> >>> >> can clear DAG runs,
> >>> mark
> >>> >> success, etc. while keeping them away from our core ETL and other
> >>> >> people's/organization's DAGs. Some of this can be accomplished
> >>> >> with
> >>> ‘filter
> >>> >> by owner’ but it doesn’t address the use case where a DAG can be
> >>> maintained
> >>> >> by multiple users in the same organization when they have
> >>> >> separate
> >>> Airflow
> >>> >> user accounts.
> >>> >> >
> >>> >> > 2) An option to turn off backfill:
> >>> >> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2
> >>> >> > F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-558&data=01%7C0
> >>> >> > 1%7C%7C0064f74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b8539
> >>> >> > 1feaea64919%7C1&sdata=Xkz7dTkFMEa4np19m4ML1VajVqVPNy%2BVSS5Y%2B
> >>> >> > Sm8Odk%3D&reserved=0 For cases where a DAG does an insert
> >>> >> > overwrite on a table every day.
> >>> >> This might be a realistic option for the current version but I
> >>> >> just
> >>> wanted
> >>> >> to call attention to this feature request.
> >>> >> >
> >>> >> > Best,
> >>> >> > David
> >>> >> >
> >>> >> > On Nov 17, 2016, at 6:19 PM, Maxime Beauchemin <
> >>> >> maximebeauche...@gmail.com<mailto:maximebeauche...@gmail.com>>
> wrote:
> >>> >> >
> >>> >> > *This is a brainstorm email thread about Airflow 2.0!*
> >>> >> >
> >>> >> > I wanted to share some ideas around what I would like to do in
> >>> Airflow
> >>> >> 2.0
> >>> >> > and would love to hear what others are thinking. I'll compile
> >>> >> > the
> >>> ideas
> >>> >> > that are shared in this thread in a Wiki once the conversation
> fades.
> >>> >> >
> >>> >> > -------------------------------------------
> >>> >> >
> >>> >> > First idea, to get the conversation started:
> >>> >> >
> >>> >> > *Breaking down the package*
> >>> >> > `pip install airflow-common airflow-scheduler airflow-webserver
> >>> >> > airflow-operators-googlecloud ...`
> >>> >> >
> >>> >> > It seems to me like we're getting to a point where having
> >>> >> > different repositories and different packages would make things
> >>> >> > much easier in
> >>> all
> >>> >> > sorts of ways. For instance the web server is a lot less
> >>> >> > sensitive
> >>> than
> >>> >> the
> >>> >> > scheduler, and changes to operators should/could be deployed at
> >>> >> > will, independently from the main package. People in their
> >>> >> > environment
> >>> could
> >>> >> > upgrade only certain packages when needed. Travis builds would
> >>> >> > be
> >>> more
> >>> >> > targeted, and take less time, ...
> >>> >> >
> >>> >> > Also, the whole current "extra_requires" approach to optional
> >>> >> dependencies
> >>> >> > (in setup.py) is kind getting out-of-hand.
> >>> >> >
> >>> >> > Of course `pip install airflow` would bring in a collection of
> >>> >> sub-packages
> >>> >> > similar in functionality to what it does now, perhaps without
> >>> >> > so many operators you probably don't need in your environment.
> >>> >> >
> >>> >> > The release process is the main pain-point and the biggest risk
> >>> >> > for
> >>> the
> >>> >> > project, and I feel like this a solid solution to address it.
> >>> >> >
> >>> >> > Max
> >>> >> >
> >>> >>
> >>>
> >>
> >>
>
-- 

Sergei

Reply via email to