Also, a survey will be a little less noisy and easier to summarize than +1s
in this email thread.
-s (Sid)

On Mon, Nov 21, 2016 at 2:25 PM, siddharth anand <[email protected]> wrote:

> Sergei,
> These are some great ideas -- I would classify at least half of them as
> pain points.
>
> Folks!
> I suggest people (on the dev list) keep feeding this thread at least for
> the next 2 days. I can then float a survey based on these ideas and give
> the community a chance to vote so we can prioritize the wish list.
>
> -s
>
> On Mon, Nov 21, 2016 at 5:22 AM, Sergei Iakhnin <[email protected]> wrote:
>
>> I've been running Airflow on 1500 cores in the context of scientific
>> workflows for the past year and a half. Features that would be important
>> to
>> me for 2.0:
>>
>> - Add FK to dag_run to the task_instance table on Postgres so that
>> task_instances can be uniquely attributed to dag runs.
>> - Ensure scheduler can be run continuously without needing restarts. Right
>> now it gets into some ill-determined bad state forcing me to restart it
>> every 20 minutes.
>> - Ensure scheduler can handle tens of thousands of active workflows. Right
>> now this results in extremely long scheduling times and inconsistent
>> scheduling even at 2 thousand active workflows.
>> - Add more flexible task scheduling prioritization. The default
>> prioritization is the opposite of the behaviour I want. I would prefer
>> that
>> downstream tasks always have higher priority than upstream tasks to cause
>> entire workflows to tend to complete sooner, rather than scheduling tasks
>> from other workflows. Having a few scheduling prioritization strategies
>> would be beneficial here.
>> - Provide better support for manually-triggered DAGs on the UI i.e. by
>> showing them as queued.
>> - Provide some resource management capabilities via something like slots
>> that can be defined on workers and occupied by tasks. Using celery's
>> concurrency parameter at the airflow server level is too coarse-grained as
>> it forces all workers to be the same, and does not allow proper resource
>> management when different workflow tasks have different resource
>> requirements thus hurting utilization (a worker could run 8 parallel tasks
>> with small memory footprint, but only 1 task with large memory footprint
>> for instance).
>>
>> With best regards,
>>
>> Sergei.
>>
>>
>> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
>> [email protected]>
>> wrote:
>>
>> > -1. We extremely rely on data profiling, as a pipeline health monitoring
>> > tool
>> >
>> > -----Original Message-----
>> > From: Chris Riccomini [mailto:[email protected]]
>> > Sent: Saturday, November 19, 2016 1:57 AM
>> > To: [email protected]
>> > Subject: Re: Airflow 2.0
>> >
>> > > RIP out the charting application and the data profiler
>> >
>> > Yes please! +1
>> >
>> > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
>> > [email protected]> wrote:
>> > > Another point that may be controversial for Airflow 2.0: RIP out the
>> > > charting application and the data profiler. Even though it's nice to
>> > > have it there, it's just out of scope and has major security
>> > issues/implications.
>> > >
>> > > I'm not sure how popular it actually is. We may need to run a survey
>> > > at some point around this kind of questions.
>> > >
>> > > Max
>> > >
>> > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
>> > > [email protected]> wrote:
>> > >
>> > >> Using FAB's Model, we get pretty much all of that (REST API,
>> > >> auth/perms,
>> > >> CRUD) for free:
>> > >> https://emea01.safelinks.protection.outlook.com/?url=http%
>> 3A%2F%2Ffla
>> > >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7
>> C%7C0064f
>> > >> 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea649
>> 19%7C1&sd
>> > >> ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0
>> > >> quickhowto.html?highlight=rest#exposed-methods
>> > >>
>> > >> I'm pretty intimate with FAB since I use it (and contributed to it)
>> > >> for Superset/Caravel.
>> > >>
>> > >> All that's needed is to derive FAB's model class instead of
>> > >> SqlAlchemy's model class (which FAB's model wraps and adds
>> > >> functionality to and is 100% compatible AFAICT).
>> > >>
>> > >> Max
>> > >>
>> > >> On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini
>> > >> <[email protected]>
>> > >> wrote:
>> > >>
>> > >>> > It may be doable to run this as a different package
>> > >>> `airflow-webserver`, an
>> > >>> > alternate UI at first, and to eventually rip out the old UI off of
>> > >>> > the
>> > >>> main
>> > >>> > package.
>> > >>>
>> > >>> This is the same strategy that I was thinking of for AIRFLOW-85. You
>> > >>> can build the new UI in parallel, and then delete the old one later.
>> > >>> I really think that a REST interface should be a pre-req to any
>> > >>> large/new UI changes, though. Getting unified so that everything is
>> > >>> driven through REST will be a big win.
>> > >>>
>> > >>> On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin
>> > >>> <[email protected]> wrote:
>> > >>> > A multi-tenant UI with composable roles on top of granular
>> > permissions.
>> > >>> >
>> > >>> > Migrating from Flask-Admin to Flask App Builder would be an
>> > >>> > easy-ish win (since they're both Flask). FAB Provides a good
>> > >>> > authentication and permission model that ships out-of-the-box with
>> > >>> > a REST api. Suffice to define FAB models (derivative of
>> > >>> > SQLAlchemy's model) and you get a set
>> > >>> of
>> > >>> > perms for the model (can_show, can_list, can_add, can_change,
>> > >>> can_delete,
>> > >>> > ...) and a set of CRUD REST endpoints. It would also allow us to
>> > >>> > rip out the authentication backend code out of Airflow and rely on
>> > FAB for that.
>> > >>> > Also every single view gets permissions auto-created for it, and
>> > >>> > there
>> > >>> are
>> > >>> > easy way to define row-level type filters based on user
>> permissions.
>> > >>> >
>> > >>> > It may be doable to run this as a different package
>> > >>> `airflow-webserver`, an
>> > >>> > alternate UI at first, and to eventually rip out the old UI off of
>> > >>> > the
>> > >>> main
>> > >>> > package.
>> > >>> >
>> > >>> > https://emea01.safelinks.protection.outlook.com/?url=https%
>> 3A%2F%2
>> > >>> > Fflask-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C
>> 01%7C%
>> > >>> > 7C0064f74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391f
>> eaea64
>> > >>> > 919%7C1&sdata=8mUPRcf4%2FQUDSbju%2BjLLImalhZeU7tOA%2BFpeO%
>> 2BjcEs8%
>> > >>> > 3D&reserved=0
>> > >>> >
>> > >>> > I'd love to carve some time and lead this.
>> > >>> >
>> > >>> > On Fri, Nov 18, 2016 at 1:32 PM, Chris Riccomini
>> > >>> > <[email protected]
>> > >>> >
>> > >>> > wrote:
>> > >>> >
>> > >>> >> Full-fledged REST API (that the UI also uses) would be great in
>> 2.0.
>> > >>> >>
>> > >>> >> On Fri, Nov 18, 2016 at 6:26 AM, David Kegley <[email protected]>
>> wrote:
>> > >>> >> > Hi All,
>> > >>> >> >
>> > >>> >> > We have been using Airflow heavily for the last couple months
>> > >>> >> > and
>> > >>> it’s
>> > >>> >> been great so far. Here are a few things we’d like to see
>> > >>> >> prioritized
>> > >>> in
>> > >>> >> 2.0.
>> > >>> >> >
>> > >>> >> > 1) Role based access to DAGs:
>> > >>> >> > We would like to see better role based access through the UI.
>> > >>> There’s a
>> > >>> >> related ticket out there but it hasn’t seen any action in a few
>> > >>> >> months
>> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url=https%
>> 3A%2
>> > >>> >> > F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-85&data=01%7
>> C01
>> > >>> >> > %7C%7C0064f74fd0d940ab732808d4100e9c3f%
>> 7C6d4034cd72254f72b85391
>> > >>> >> > feaea64919%7C1&sdata=VsgwHZxr0%2FDQN1jeBTJsfyIGu%2FZkkWhzAvx
>> NvB
>> > >>> >> > N531k%3D&reserved=0
>> > >>> >> >
>> > >>> >> > We use a templating system to create/deploy DAGs dynamically
>> > >>> >> > based on
>> > >>> >> some directory/file structure. This allows analysts to quickly
>> > >>> >> deploy
>> > >>> and
>> > >>> >> schedule their ETL code without having to interact with the
>> > >>> >> Airflow installation directly. It would be great if those same
>> > >>> >> analysts could access to their own DAGs in the UI so that they
>> > >>> >> can clear DAG runs,
>> > >>> mark
>> > >>> >> success, etc. while keeping them away from our core ETL and other
>> > >>> >> people's/organization's DAGs. Some of this can be accomplished
>> > >>> >> with
>> > >>> ‘filter
>> > >>> >> by owner’ but it doesn’t address the use case where a DAG can be
>> > >>> maintained
>> > >>> >> by multiple users in the same organization when they have
>> > >>> >> separate
>> > >>> Airflow
>> > >>> >> user accounts.
>> > >>> >> >
>> > >>> >> > 2) An option to turn off backfill:
>> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url=https%
>> 3A%2
>> > >>> >> > F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-558&data=01%
>> 7C0
>> > >>> >> > 1%7C%7C0064f74fd0d940ab732808d4100e9c3f%
>> 7C6d4034cd72254f72b8539
>> > >>> >> > 1feaea64919%7C1&sdata=Xkz7dTkFMEa4np19m4ML1VajVqVPNy%
>> 2BVSS5Y%2B
>> > >>> >> > Sm8Odk%3D&reserved=0 For cases where a DAG does an insert
>> > >>> >> > overwrite on a table every day.
>> > >>> >> This might be a realistic option for the current version but I
>> > >>> >> just
>> > >>> wanted
>> > >>> >> to call attention to this feature request.
>> > >>> >> >
>> > >>> >> > Best,
>> > >>> >> > David
>> > >>> >> >
>> > >>> >> > On Nov 17, 2016, at 6:19 PM, Maxime Beauchemin <
>> > >>> >> [email protected]<mailto:[email protected]>>
>> > wrote:
>> > >>> >> >
>> > >>> >> > *This is a brainstorm email thread about Airflow 2.0!*
>> > >>> >> >
>> > >>> >> > I wanted to share some ideas around what I would like to do in
>> > >>> Airflow
>> > >>> >> 2.0
>> > >>> >> > and would love to hear what others are thinking. I'll compile
>> > >>> >> > the
>> > >>> ideas
>> > >>> >> > that are shared in this thread in a Wiki once the conversation
>> > fades.
>> > >>> >> >
>> > >>> >> > -------------------------------------------
>> > >>> >> >
>> > >>> >> > First idea, to get the conversation started:
>> > >>> >> >
>> > >>> >> > *Breaking down the package*
>> > >>> >> > `pip install airflow-common airflow-scheduler airflow-webserver
>> > >>> >> > airflow-operators-googlecloud ...`
>> > >>> >> >
>> > >>> >> > It seems to me like we're getting to a point where having
>> > >>> >> > different repositories and different packages would make things
>> > >>> >> > much easier in
>> > >>> all
>> > >>> >> > sorts of ways. For instance the web server is a lot less
>> > >>> >> > sensitive
>> > >>> than
>> > >>> >> the
>> > >>> >> > scheduler, and changes to operators should/could be deployed at
>> > >>> >> > will, independently from the main package. People in their
>> > >>> >> > environment
>> > >>> could
>> > >>> >> > upgrade only certain packages when needed. Travis builds would
>> > >>> >> > be
>> > >>> more
>> > >>> >> > targeted, and take less time, ...
>> > >>> >> >
>> > >>> >> > Also, the whole current "extra_requires" approach to optional
>> > >>> >> dependencies
>> > >>> >> > (in setup.py) is kind getting out-of-hand.
>> > >>> >> >
>> > >>> >> > Of course `pip install airflow` would bring in a collection of
>> > >>> >> sub-packages
>> > >>> >> > similar in functionality to what it does now, perhaps without
>> > >>> >> > so many operators you probably don't need in your environment.
>> > >>> >> >
>> > >>> >> > The release process is the main pain-point and the biggest risk
>> > >>> >> > for
>> > >>> the
>> > >>> >> > project, and I feel like this a solid solution to address it.
>> > >>> >> >
>> > >>> >> > Max
>> > >>> >> >
>> > >>> >>
>> > >>>
>> > >>
>> > >>
>> >
>> --
>>
>> Sergei
>>
>
>

Reply via email to