With the announcement of AWS Batch (https://aws.amazon.com/batch/), and my own selfish needs, I think it'd be really great to generally support Batch systems like AWS Batch, Slurm, and Torque as executors, potentially with an extension of the BashOperator, but I think it might actually be flexible enough to not need a dedicated BatchOperator.
Brian On Nov 24, 2016, at 7:40 AM, Maycock, Luke <[email protected]<mailto:[email protected]>> wrote: Add FK to dag_run to the task_instance table on Postgres so that task_instances can be uniquely attributed to dag runs. + 1 Also, I believe xcoms would need to be addressed in the same way at the same time - I have added a comment to that affect on https://issues.apache.org/jira/browse/AIRFLOW-642 I believe this would be implemented for all supported back-ends, not just PostgreSQL. Cheers, Luke Maycock OLIVER WYMAN [email protected]<mailto:[email protected]><mailto:[email protected]> www.oliverwyman.com<http://www.oliverwyman.com><http://www.oliverwyman.com/> ________________________________ From: Arunprasad Venkatraman <[email protected]<mailto:[email protected]>> Sent: 21 November 2016 18:16 To: [email protected]<mailto:[email protected]> Subject: Re: Airflow 2.0 Add FK to dag_run to the task_instance table on Postgres so that task_instances can be uniquely attributed to dag runs. Ensure scheduler can be run continuously without needing restarts. Ensure scheduler can handle tens of thousands of active workflows +1 We are planning to run around 40,000 tasks a day using airflow and some of them are critical to give quick feedback to developers. Currently having execution date to uniquely identify tasks does not work for us since we mainly trigger dags (instead of running them on schedule). And we collide with 1 sec granularity on several occasions. Having a task uuid or associating dag_run to task_instance as suggested by Sergei table will help mitigate this issue for us and would make it easy for us to update task results too. We would be happy to start working on this if it makes sense. Also we are wondering if there were any work done in community to support multiple schedulers(or alternates to mysql/Postgres) because 1 scheduler does not scale for us well and we see slow down of up to couple of minute sometimes when there are several pending tasks. Thanks On Mon, Nov 21, 2016 at 9:57 AM, Chris Riccomini <[email protected]<mailto:[email protected]>> wrote: Ensure scheduler can be run continuously without needing restarts +1 On Mon, Nov 21, 2016 at 5:25 AM, David Batista <[email protected]<mailto:[email protected]>> wrote: A small request, which might be handy. Having the possibility to select multiple tasks and mark them as Success/Clear/etc. Allow the UI to select individual tasks (i.e., inside the Tree View) and then have a button to mark them as Success/Clear/etc. On 21 November 2016 at 14:22, Sergei Iakhnin <[email protected]<mailto:[email protected]>> wrote: I've been running Airflow on 1500 cores in the context of scientific workflows for the past year and a half. Features that would be important to me for 2.0: - Add FK to dag_run to the task_instance table on Postgres so that task_instances can be uniquely attributed to dag runs. - Ensure scheduler can be run continuously without needing restarts. Right now it gets into some ill-determined bad state forcing me to restart it every 20 minutes. - Ensure scheduler can handle tens of thousands of active workflows. Right now this results in extremely long scheduling times and inconsistent scheduling even at 2 thousand active workflows. - Add more flexible task scheduling prioritization. The default prioritization is the opposite of the behaviour I want. I would prefer that downstream tasks always have higher priority than upstream tasks to cause entire workflows to tend to complete sooner, rather than scheduling tasks from other workflows. Having a few scheduling prioritization strategies would be beneficial here. - Provide better support for manually-triggered DAGs on the UI i.e. by showing them as queued. - Provide some resource management capabilities via something like slots that can be defined on workers and occupied by tasks. Using celery's concurrency parameter at the airflow server level is too coarse-grained as it forces all workers to be the same, and does not allow proper resource management when different workflow tasks have different resource requirements thus hurting utilization (a worker could run 8 parallel tasks with small memory footprint, but only 1 task with large memory footprint for instance). With best regards, Sergei. On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo < [email protected]<mailto:[email protected]>> wrote: -1. We extremely rely on data profiling, as a pipeline health monitoring tool -----Original Message----- From: Chris Riccomini [mailto:[email protected]] Sent: Saturday, November 19, 2016 1:57 AM To: [email protected]<mailto:[email protected]> Subject: Re: Airflow 2.0 RIP out the charting application and the data profiler Yes please! +1 On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin < [email protected]<mailto:[email protected]>> wrote: Another point that may be controversial for Airflow 2.0: RIP out the charting application and the data profiler. Even though it's nice to have it there, it's just out of scope and has major security issues/implications. I'm not sure how popular it actually is. We may need to run a survey at some point around this kind of questions. Max On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin < [email protected]<mailto:[email protected]>> wrote: Using FAB's Model, we get pretty much all of that (REST API, auth/perms, CRUD) for free: https://emea01.safelinks.protection.outlook.com/?url= http%3A%2F%2Ffla sk-appbuilder.readthedocs.io<http://sk-appbuilder.readthedocs.io>%2Fen%2Flatest%2F&data=01%7C01% 7C%7C0064f 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea6 4919%7C1&sd ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0 quickhowto.html?highlight=rest#exposed-methods I'm pretty intimate with FAB since I use it (and contributed to it) for Superset/Caravel. All that's needed is to derive FAB's model class instead of SqlAlchemy's model class (which FAB's model wraps and adds functionality to and is 100% compatible AFAICT). Max On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini <[email protected]<mailto:[email protected]>> wrote: It may be doable to run this as a different package `airflow-webserver`, an alternate UI at first, and to eventually rip out the old UI off of the main package. This is the same strategy that I was thinking of for AIRFLOW-85. You can build the new UI in parallel, and then delete the old one later. I really think that a REST interface should be a pre-req to any large/new UI changes, though. Getting unified so that everything is driven through REST will be a big win. On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin <[email protected]<mailto:[email protected]>> wrote: A multi-tenant UI with composable roles on top of granular permissions. Migrating from Flask-Admin to Flask App Builder would be an easy-ish win (since they're both Flask). FAB Provides a good authentication and permission model that ships out-of-the-box with a REST api. Suffice to define FAB models (derivative of SQLAlchemy's model) and you get a set of perms for the model (can_show, can_list, can_add, can_change, can_delete, ...) and a set of CRUD REST endpoints. It would also allow us to rip out the authentication backend code out of Airflow and rely on FAB for that. Also every single view gets permissions auto-created for it, and there are easy way to define row-level type filters based on user permissions. It may be doable to run this as a different package `airflow-webserver`, an alternate UI at first, and to eventually rip out the old UI off of the main package. https://emea01.safelinks.protection.outlook.com/?url= https%3A%2F%2 Fflask-appbuilder.readthedocs.io<http://Fflask-appbuilder.readthedocs.io>%2Fen%2Flatest%2F&data=01% 7C01%7C% 7C0064f74fd0d940ab732808d4100e9c3f% 7C6d4034cd72254f72b85391feaea64 919%7C1&sdata=8mUPRcf4%2FQUDSbju%2BjLLImalhZeU7tOA% 2BFpeO%2BjcEs8% 3D&reserved=0 I'd love to carve some time and lead this. On Fri, Nov 18, 2016 at 1:32 PM, Chris Riccomini <[email protected]<mailto:[email protected]> wrote: Full-fledged REST API (that the UI also uses) would be great in 2.0. On Fri, Nov 18, 2016 at 6:26 AM, David Kegley <[email protected]<mailto:[email protected]>> wrote: Hi All, We have been using Airflow heavily for the last couple months and it’s been great so far. Here are a few things we’d like to see prioritized in 2.0. 1) Role based access to DAGs: We would like to see better role based access through the UI. There’s a related ticket out there but it hasn’t seen any action in a few months https://emea01.safelinks.protection.outlook.com/?url= https%3A%2 F%2Fissues.apache.org<http://2Fissues.apache.org>%2Fjira%2Fbrowse%2FAIRFLOW-85&data=01% 7C01 %7C%7C0064f74fd0d940ab732808d4100e 9c3f%7C6d4034cd72254f72b85391 feaea64919%7C1&sdata=VsgwHZxr0%2FDQN1jeBTJsfyIGu% 2FZkkWhzAvxNvB N531k%3D&reserved=0 We use a templating system to create/deploy DAGs dynamically based on some directory/file structure. This allows analysts to quickly deploy and schedule their ETL code without having to interact with the Airflow installation directly. It would be great if those same analysts could access to their own DAGs in the UI so that they can clear DAG runs, mark success, etc. while keeping them away from our core ETL and other people's/organization's DAGs. Some of this can be accomplished with ‘filter by owner’ but it doesn’t address the use case where a DAG can be maintained by multiple users in the same organization when they have separate Airflow user accounts. 2) An option to turn off backfill: https://emea01.safelinks.protection.outlook.com/?url= https%3A%2 F%2Fissues.apache.org<http://2Fissues.apache.org>%2Fjira%2Fbrowse%2FAIRFLOW-558&data= 01%7C0 1%7C%7C0064f74fd0d940ab732808d4100e 9c3f%7C6d4034cd72254f72b8539 1feaea64919%7C1&sdata=Xkz7dTkFMEa4np19m4ML1VajVqVPNy %2BVSS5Y%2B Sm8Odk%3D&reserved=0 For cases where a DAG does an insert overwrite on a table every day. This might be a realistic option for the current version but I just wanted to call attention to this feature request. Best, David On Nov 17, 2016, at 6:19 PM, Maxime Beauchemin < [email protected]<mailto:[email protected]><mailto:[email protected]>> wrote: *This is a brainstorm email thread about Airflow 2.0!* I wanted to share some ideas around what I would like to do in Airflow 2.0 and would love to hear what others are thinking. I'll compile the ideas that are shared in this thread in a Wiki once the conversation fades. ------------------------------------------- First idea, to get the conversation started: *Breaking down the package* `pip install airflow-common airflow-scheduler airflow-webserver airflow-operators-googlecloud ...` It seems to me like we're getting to a point where having different repositories and different packages would make things much easier in all sorts of ways. For instance the web server is a lot less sensitive than the scheduler, and changes to operators should/could be deployed at will, independently from the main package. People in their environment could upgrade only certain packages when needed. Travis builds would be more targeted, and take less time, ... Also, the whole current "extra_requires" approach to optional dependencies (in setup.py) is kind getting out-of-hand. Of course `pip install airflow` would bring in a collection of sub-packages similar in functionality to what it does now, perhaps without so many operators you probably don't need in your environment. The release process is the main pain-point and the biggest risk for the project, and I feel like this a solid solution to address it. Max -- Sergei -- *David Batista* *Data Engineer**, HelloFresh Global* Saarbrücker Str. 37a | 10405 Berlin [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> -- [image: logo] <http://www.facebook.com/hellofreshde> <http://twitter.com/ HelloFreshde> <http://instagram.com/hellofreshde/> <http://blog.hellofresh.de/> <https://app.adjust.com/ayje08?campaign=Hellofresh& deep_link=hellofresh%3A%2F%2F&post_deep_link=https%3A%2F% 2Fwww.hellofresh.com<http://2Fwww.hellofresh.com>%2Fapp%2F%3Futm_medium%3Demail%26utm_ source%3Demail_signature&fallback=https%3A%2F%2Fwww. hellofresh.com<http://hellofresh.com>%2Fapp%2F%3Futm_medium%3Demail%26utm_source% 3Demail_signature> *HelloFresh App –Download Now!* <https://app.adjust.com/ayje08?campaign=Hellofresh& deep_link=hellofresh%3A%2F%2F&post_deep_link=https%3A%2F% 2Fwww.hellofresh.com<http://2Fwww.hellofresh.com>%2Fapp%2F%3Futm_medium%3Demail%26utm_ source%3Demail_signature&fallback=https%3A%2F%2Fwww. hellofresh.com<http://hellofresh.com>%2Fapp%2F%3Futm_medium%3Demail%26utm_source% 3Demail_signature> *We're active in:* US <https://www.hellofresh.com/?utm_medium=email&utm_source= email_signature> | DE <https://www.hellofresh.de/?utm_medium=email&utm_source=email_signature> | UK <https://www.hellofresh.co.uk/?utm_medium=email&utm_source= email_signature> | NL <https://www.hellofresh.nl/?utm_medium=email&utm_source=email_signature> | AU <https://www.hellofresh.com.au/?utm_medium=email&utm_ source=email_signature> | BE <https://www.hellofresh.be/?utm_medium=email&utm_source=email_signature> | AT <https://www.hellofresh.at/?utm_medium=email&utm_source= email_signature> | CH <https://www.hellofresh.ch/?utm_medium=email&utm_source=email_signature> | CA <https://www.hellofresh.ca/?utm_medium=email&utm_source= email_signature> www.HelloFreshGroup.com<http://www.HelloFreshGroup.com> <http://www.hellofreshgroup.com/?utm_medium=email&utm_ source=email_signature> We are hiring around the world – Click here to join us <https://www.hellofresh.com/jobs/?utm_medium=email&utm_ source=email_signature> -- <https://www.hellofresh.com/jobs/?utm_medium=email&utm_ source=email_signature> HelloFresh AG, Berlin (Sitz der Gesellschaft) | Vorstände: Dominik S. Richter (Vorsitzender), Thomas W. Griesel, Christian Gärtner | Vorsitzender des Aufsichtsrats: Jeffrey Lieberman | Eingetragen beim Amtsgericht Charlottenburg, HRB 171666 B | USt-Id Nr.: DE 302210417 *CONFIDENTIALITY NOTICE:* This message (including any attachments) is confidential and may be privileged. It may be read, copied and used only by the intended recipient. If you have received it in error please contact the sender (by return e-mail) immediately and delete this message. Any unauthorized use or dissemination of this message in whole or in parts is strictly prohibited. ________________________________ This e-mail and any attachments may be confidential or legally privileged. If you received this message in error or are not the intended recipient, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained herein. Please inform us of the erroneous delivery by return e-mail. Thank you for your cooperation.
