Also, a survey will be a little less noisy and easier to summarize than +1s in this email thread. -s (Sid)
On Mon, Nov 21, 2016 at 2:25 PM, siddharth anand <[email protected]> wrote: > Sergei, > These are some great ideas -- I would classify at least half of them as > pain points. > > Folks! > I suggest people (on the dev list) keep feeding this thread at least for > the next 2 days. I can then float a survey based on these ideas and give > the community a chance to vote so we can prioritize the wish list. > > -s > > On Mon, Nov 21, 2016 at 5:22 AM, Sergei Iakhnin <[email protected]> wrote: > >> I've been running Airflow on 1500 cores in the context of scientific >> workflows for the past year and a half. Features that would be important >> to >> me for 2.0: >> >> - Add FK to dag_run to the task_instance table on Postgres so that >> task_instances can be uniquely attributed to dag runs. >> - Ensure scheduler can be run continuously without needing restarts. Right >> now it gets into some ill-determined bad state forcing me to restart it >> every 20 minutes. >> - Ensure scheduler can handle tens of thousands of active workflows. Right >> now this results in extremely long scheduling times and inconsistent >> scheduling even at 2 thousand active workflows. >> - Add more flexible task scheduling prioritization. The default >> prioritization is the opposite of the behaviour I want. I would prefer >> that >> downstream tasks always have higher priority than upstream tasks to cause >> entire workflows to tend to complete sooner, rather than scheduling tasks >> from other workflows. Having a few scheduling prioritization strategies >> would be beneficial here. >> - Provide better support for manually-triggered DAGs on the UI i.e. by >> showing them as queued. >> - Provide some resource management capabilities via something like slots >> that can be defined on workers and occupied by tasks. Using celery's >> concurrency parameter at the airflow server level is too coarse-grained as >> it forces all workers to be the same, and does not allow proper resource >> management when different workflow tasks have different resource >> requirements thus hurting utilization (a worker could run 8 parallel tasks >> with small memory footprint, but only 1 task with large memory footprint >> for instance). >> >> With best regards, >> >> Sergei. >> >> >> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo < >> [email protected]> >> wrote: >> >> > -1. We extremely rely on data profiling, as a pipeline health monitoring >> > tool >> > >> > -----Original Message----- >> > From: Chris Riccomini [mailto:[email protected]] >> > Sent: Saturday, November 19, 2016 1:57 AM >> > To: [email protected] >> > Subject: Re: Airflow 2.0 >> > >> > > RIP out the charting application and the data profiler >> > >> > Yes please! +1 >> > >> > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin < >> > [email protected]> wrote: >> > > Another point that may be controversial for Airflow 2.0: RIP out the >> > > charting application and the data profiler. Even though it's nice to >> > > have it there, it's just out of scope and has major security >> > issues/implications. >> > > >> > > I'm not sure how popular it actually is. We may need to run a survey >> > > at some point around this kind of questions. >> > > >> > > Max >> > > >> > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin < >> > > [email protected]> wrote: >> > > >> > >> Using FAB's Model, we get pretty much all of that (REST API, >> > >> auth/perms, >> > >> CRUD) for free: >> > >> https://emea01.safelinks.protection.outlook.com/?url=http% >> 3A%2F%2Ffla >> > >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%7 >> C%7C0064f >> > >> 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea649 >> 19%7C1&sd >> > >> ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0 >> > >> quickhowto.html?highlight=rest#exposed-methods >> > >> >> > >> I'm pretty intimate with FAB since I use it (and contributed to it) >> > >> for Superset/Caravel. >> > >> >> > >> All that's needed is to derive FAB's model class instead of >> > >> SqlAlchemy's model class (which FAB's model wraps and adds >> > >> functionality to and is 100% compatible AFAICT). >> > >> >> > >> Max >> > >> >> > >> On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini >> > >> <[email protected]> >> > >> wrote: >> > >> >> > >>> > It may be doable to run this as a different package >> > >>> `airflow-webserver`, an >> > >>> > alternate UI at first, and to eventually rip out the old UI off of >> > >>> > the >> > >>> main >> > >>> > package. >> > >>> >> > >>> This is the same strategy that I was thinking of for AIRFLOW-85. You >> > >>> can build the new UI in parallel, and then delete the old one later. >> > >>> I really think that a REST interface should be a pre-req to any >> > >>> large/new UI changes, though. Getting unified so that everything is >> > >>> driven through REST will be a big win. >> > >>> >> > >>> On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin >> > >>> <[email protected]> wrote: >> > >>> > A multi-tenant UI with composable roles on top of granular >> > permissions. >> > >>> > >> > >>> > Migrating from Flask-Admin to Flask App Builder would be an >> > >>> > easy-ish win (since they're both Flask). FAB Provides a good >> > >>> > authentication and permission model that ships out-of-the-box with >> > >>> > a REST api. Suffice to define FAB models (derivative of >> > >>> > SQLAlchemy's model) and you get a set >> > >>> of >> > >>> > perms for the model (can_show, can_list, can_add, can_change, >> > >>> can_delete, >> > >>> > ...) and a set of CRUD REST endpoints. It would also allow us to >> > >>> > rip out the authentication backend code out of Airflow and rely on >> > FAB for that. >> > >>> > Also every single view gets permissions auto-created for it, and >> > >>> > there >> > >>> are >> > >>> > easy way to define row-level type filters based on user >> permissions. >> > >>> > >> > >>> > It may be doable to run this as a different package >> > >>> `airflow-webserver`, an >> > >>> > alternate UI at first, and to eventually rip out the old UI off of >> > >>> > the >> > >>> main >> > >>> > package. >> > >>> > >> > >>> > https://emea01.safelinks.protection.outlook.com/?url=https% >> 3A%2F%2 >> > >>> > Fflask-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C >> 01%7C% >> > >>> > 7C0064f74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391f >> eaea64 >> > >>> > 919%7C1&sdata=8mUPRcf4%2FQUDSbju%2BjLLImalhZeU7tOA%2BFpeO% >> 2BjcEs8% >> > >>> > 3D&reserved=0 >> > >>> > >> > >>> > I'd love to carve some time and lead this. >> > >>> > >> > >>> > On Fri, Nov 18, 2016 at 1:32 PM, Chris Riccomini >> > >>> > <[email protected] >> > >>> > >> > >>> > wrote: >> > >>> > >> > >>> >> Full-fledged REST API (that the UI also uses) would be great in >> 2.0. >> > >>> >> >> > >>> >> On Fri, Nov 18, 2016 at 6:26 AM, David Kegley <[email protected]> >> wrote: >> > >>> >> > Hi All, >> > >>> >> > >> > >>> >> > We have been using Airflow heavily for the last couple months >> > >>> >> > and >> > >>> it’s >> > >>> >> been great so far. Here are a few things we’d like to see >> > >>> >> prioritized >> > >>> in >> > >>> >> 2.0. >> > >>> >> > >> > >>> >> > 1) Role based access to DAGs: >> > >>> >> > We would like to see better role based access through the UI. >> > >>> There’s a >> > >>> >> related ticket out there but it hasn’t seen any action in a few >> > >>> >> months >> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url=https% >> 3A%2 >> > >>> >> > F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-85&data=01%7 >> C01 >> > >>> >> > %7C%7C0064f74fd0d940ab732808d4100e9c3f% >> 7C6d4034cd72254f72b85391 >> > >>> >> > feaea64919%7C1&sdata=VsgwHZxr0%2FDQN1jeBTJsfyIGu%2FZkkWhzAvx >> NvB >> > >>> >> > N531k%3D&reserved=0 >> > >>> >> > >> > >>> >> > We use a templating system to create/deploy DAGs dynamically >> > >>> >> > based on >> > >>> >> some directory/file structure. This allows analysts to quickly >> > >>> >> deploy >> > >>> and >> > >>> >> schedule their ETL code without having to interact with the >> > >>> >> Airflow installation directly. It would be great if those same >> > >>> >> analysts could access to their own DAGs in the UI so that they >> > >>> >> can clear DAG runs, >> > >>> mark >> > >>> >> success, etc. while keeping them away from our core ETL and other >> > >>> >> people's/organization's DAGs. Some of this can be accomplished >> > >>> >> with >> > >>> ‘filter >> > >>> >> by owner’ but it doesn’t address the use case where a DAG can be >> > >>> maintained >> > >>> >> by multiple users in the same organization when they have >> > >>> >> separate >> > >>> Airflow >> > >>> >> user accounts. >> > >>> >> > >> > >>> >> > 2) An option to turn off backfill: >> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url=https% >> 3A%2 >> > >>> >> > F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-558&data=01% >> 7C0 >> > >>> >> > 1%7C%7C0064f74fd0d940ab732808d4100e9c3f% >> 7C6d4034cd72254f72b8539 >> > >>> >> > 1feaea64919%7C1&sdata=Xkz7dTkFMEa4np19m4ML1VajVqVPNy% >> 2BVSS5Y%2B >> > >>> >> > Sm8Odk%3D&reserved=0 For cases where a DAG does an insert >> > >>> >> > overwrite on a table every day. >> > >>> >> This might be a realistic option for the current version but I >> > >>> >> just >> > >>> wanted >> > >>> >> to call attention to this feature request. >> > >>> >> > >> > >>> >> > Best, >> > >>> >> > David >> > >>> >> > >> > >>> >> > On Nov 17, 2016, at 6:19 PM, Maxime Beauchemin < >> > >>> >> [email protected]<mailto:[email protected]>> >> > wrote: >> > >>> >> > >> > >>> >> > *This is a brainstorm email thread about Airflow 2.0!* >> > >>> >> > >> > >>> >> > I wanted to share some ideas around what I would like to do in >> > >>> Airflow >> > >>> >> 2.0 >> > >>> >> > and would love to hear what others are thinking. I'll compile >> > >>> >> > the >> > >>> ideas >> > >>> >> > that are shared in this thread in a Wiki once the conversation >> > fades. >> > >>> >> > >> > >>> >> > ------------------------------------------- >> > >>> >> > >> > >>> >> > First idea, to get the conversation started: >> > >>> >> > >> > >>> >> > *Breaking down the package* >> > >>> >> > `pip install airflow-common airflow-scheduler airflow-webserver >> > >>> >> > airflow-operators-googlecloud ...` >> > >>> >> > >> > >>> >> > It seems to me like we're getting to a point where having >> > >>> >> > different repositories and different packages would make things >> > >>> >> > much easier in >> > >>> all >> > >>> >> > sorts of ways. For instance the web server is a lot less >> > >>> >> > sensitive >> > >>> than >> > >>> >> the >> > >>> >> > scheduler, and changes to operators should/could be deployed at >> > >>> >> > will, independently from the main package. People in their >> > >>> >> > environment >> > >>> could >> > >>> >> > upgrade only certain packages when needed. Travis builds would >> > >>> >> > be >> > >>> more >> > >>> >> > targeted, and take less time, ... >> > >>> >> > >> > >>> >> > Also, the whole current "extra_requires" approach to optional >> > >>> >> dependencies >> > >>> >> > (in setup.py) is kind getting out-of-hand. >> > >>> >> > >> > >>> >> > Of course `pip install airflow` would bring in a collection of >> > >>> >> sub-packages >> > >>> >> > similar in functionality to what it does now, perhaps without >> > >>> >> > so many operators you probably don't need in your environment. >> > >>> >> > >> > >>> >> > The release process is the main pain-point and the biggest risk >> > >>> >> > for >> > >>> the >> > >>> >> > project, and I feel like this a solid solution to address it. >> > >>> >> > >> > >>> >> > Max >> > >>> >> > >> > >>> >> >> > >>> >> > >> >> > >> >> > >> -- >> >> Sergei >> > >
