Re: Airflow 2.0

Van Klaveren, Brian N. Thu, 01 Dec 2016 12:28:18 -0800

With the announcement of AWS Batch (https://aws.amazon.com/batch/), and my own 
selfish needs, I think it'd be really great to generally support Batch systems 
like AWS Batch, Slurm, and Torque as executors, potentially with an extension 
of the BashOperator, but I think it might actually be flexible enough to not 
need a dedicated BatchOperator.


Brian


On Nov 24, 2016, at 7:40 AM, Maycock, Luke 
<[email protected]<mailto:[email protected]>>
 wrote:

Add FK to dag_run to the task_instance table on Postgres so that
task_instances can be uniquely attributed to dag runs.


+ 1


Also, I believe xcoms would need to be addressed in the same way at the same 
time - I have added a comment to that affect on 
https://issues.apache.org/jira/browse/AIRFLOW-642


I believe this would be implemented for all supported back-ends, not just 
PostgreSQL.


Cheers,
Luke Maycock
OLIVER WYMAN
[email protected]<mailto:[email protected]><mailto:[email protected]>
www.oliverwyman.com<http://www.oliverwyman.com><http://www.oliverwyman.com/>



________________________________
From: Arunprasad Venkatraman <[email protected]<mailto:[email protected]>>
Sent: 21 November 2016 18:16
To: [email protected]<mailto:[email protected]>
Subject: Re: Airflow 2.0

Add FK to dag_run to the task_instance table on Postgres so that
task_instances can be uniquely attributed to dag runs.
Ensure scheduler can be run continuously without needing restarts.
Ensure scheduler can handle tens of thousands of active workflows

+1

We are planning to run around 40,000 tasks a day using airflow and some of
them are critical to give quick feedback to developers. Currently having
execution date to uniquely identify tasks does not work for us since we
mainly trigger dags (instead of running them on schedule). And we collide
with 1 sec granularity on several occasions.  Having a task uuid or
associating dag_run to task_instance as suggested by Sergei table will help
mitigate this issue for us and would make it easy for us to update task
results too. We would be happy to start working on this if it makes sense.

Also we are wondering if there were any work done in community to support
multiple schedulers(or alternates to mysql/Postgres) because 1 scheduler
does not scale for us well and we see slow down of up to couple of minute
sometimes when there are several pending tasks.

Thanks



On Mon, Nov 21, 2016 at 9:57 AM, Chris Riccomini 
<[email protected]<mailto:[email protected]>>
wrote:

Ensure scheduler can be run continuously without needing restarts

+1

On Mon, Nov 21, 2016 at 5:25 AM, David Batista 
<[email protected]<mailto:[email protected]>> wrote:
A small request, which might be handy.

Having the possibility to select multiple tasks and mark them as
Success/Clear/etc.

Allow the UI to select individual tasks (i.e., inside the Tree View) and
then have a button to mark them as Success/Clear/etc.

On 21 November 2016 at 14:22, Sergei Iakhnin 
<[email protected]<mailto:[email protected]>> wrote:

I've been running Airflow on 1500 cores in the context of scientific
workflows for the past year and a half. Features that would be
important to
me for 2.0:

- Add FK to dag_run to the task_instance table on Postgres so that
task_instances can be uniquely attributed to dag runs.
- Ensure scheduler can be run continuously without needing restarts.
Right
now it gets into some ill-determined bad state forcing me to restart it
every 20 minutes.
- Ensure scheduler can handle tens of thousands of active workflows.
Right
now this results in extremely long scheduling times and inconsistent
scheduling even at 2 thousand active workflows.
- Add more flexible task scheduling prioritization. The default
prioritization is the opposite of the behaviour I want. I would prefer
that
downstream tasks always have higher priority than upstream tasks to
cause
entire workflows to tend to complete sooner, rather than scheduling
tasks
from other workflows. Having a few scheduling prioritization strategies
would be beneficial here.
- Provide better support for manually-triggered DAGs on the UI i.e. by
showing them as queued.
- Provide some resource management capabilities via something like slots
that can be defined on workers and occupied by tasks. Using celery's
concurrency parameter at the airflow server level is too coarse-grained
as
it forces all workers to be the same, and does not allow proper resource
management when different workflow tasks have different resource
requirements thus hurting utilization (a worker could run 8 parallel
tasks
with small memory footprint, but only 1 task with large memory footprint
for instance).

With best regards,

Sergei.


On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
[email protected]<mailto:[email protected]>>
wrote:

-1. We extremely rely on data profiling, as a pipeline health
monitoring
tool

-----Original Message-----
From: Chris Riccomini [mailto:[email protected]]
Sent: Saturday, November 19, 2016 1:57 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Airflow 2.0

RIP out the charting application and the data profiler

Yes please! +1

On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
[email protected]<mailto:[email protected]>> wrote:
Another point that may be controversial for Airflow 2.0: RIP out the
charting application and the data profiler. Even though it's nice to
have it there, it's just out of scope and has major security
issues/implications.

I'm not sure how popular it actually is. We may need to run a survey
at some point around this kind of questions.

Max

On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
[email protected]<mailto:[email protected]>> wrote:

Using FAB's Model, we get pretty much all of that (REST API,
auth/perms,
CRUD) for free:
https://emea01.safelinks.protection.outlook.com/?url=
http%3A%2F%2Ffla
sk-appbuilder.readthedocs.io<http://sk-appbuilder.readthedocs.io>%2Fen%2Flatest%2F&data=01%7C01%
7C%7C0064f
74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea6
4919%7C1&sd
ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0
quickhowto.html?highlight=rest#exposed-methods

I'm pretty intimate with FAB since I use it (and contributed to it)
for Superset/Caravel.

All that's needed is to derive FAB's model class instead of
SqlAlchemy's model class (which FAB's model wraps and adds
functionality to and is 100% compatible AFAICT).

Max

On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini
<[email protected]<mailto:[email protected]>>
wrote:

It may be doable to run this as a different package
`airflow-webserver`, an
alternate UI at first, and to eventually rip out the old UI off
of
the
main
package.

This is the same strategy that I was thinking of for AIRFLOW-85.
You
can build the new UI in parallel, and then delete the old one
later.
I really think that a REST interface should be a pre-req to any
large/new UI changes, though. Getting unified so that everything
is
driven through REST will be a big win.

On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin
<[email protected]<mailto:[email protected]>> wrote:
A multi-tenant UI with composable roles on top of granular
permissions.

Migrating from Flask-Admin to Flask App Builder would be an
easy-ish win (since they're both Flask). FAB Provides a good
authentication and permission model that ships out-of-the-box
with
a REST api. Suffice to define FAB models (derivative of
SQLAlchemy's model) and you get a set
of
perms for the model (can_show, can_list, can_add, can_change,
can_delete,
...) and a set of CRUD REST endpoints. It would also allow us to
rip out the authentication backend code out of Airflow and rely
on
FAB for that.
Also every single view gets permissions auto-created for it, and
there
are
easy way to define row-level type filters based on user
permissions.

It may be doable to run this as a different package
`airflow-webserver`, an
alternate UI at first, and to eventually rip out the old UI off
of
the
main
package.

https://emea01.safelinks.protection.outlook.com/?url=
https%3A%2F%2
Fflask-appbuilder.readthedocs.io<http://Fflask-appbuilder.readthedocs.io>%2Fen%2Flatest%2F&data=01%
7C01%7C%
7C0064f74fd0d940ab732808d4100e9c3f%
7C6d4034cd72254f72b85391feaea64
919%7C1&sdata=8mUPRcf4%2FQUDSbju%2BjLLImalhZeU7tOA%
2BFpeO%2BjcEs8%
3D&reserved=0

I'd love to carve some time and lead this.

On Fri, Nov 18, 2016 at 1:32 PM, Chris Riccomini
<[email protected]<mailto:[email protected]>

wrote:

Full-fledged REST API (that the UI also uses) would be great in
2.0.

On Fri, Nov 18, 2016 at 6:26 AM, David Kegley <[email protected]<mailto:[email protected]>>
wrote:
Hi All,

We have been using Airflow heavily for the last couple months
and
it’s
been great so far. Here are a few things we’d like to see
prioritized
in
2.0.

1) Role based access to DAGs:
We would like to see better role based access through the UI.
There’s a
related ticket out there but it hasn’t seen any action in a few
months
https://emea01.safelinks.protection.outlook.com/?url=
https%3A%2
F%2Fissues.apache.org<http://2Fissues.apache.org>%2Fjira%2Fbrowse%2FAIRFLOW-85&data=01%
7C01
%7C%7C0064f74fd0d940ab732808d4100e
9c3f%7C6d4034cd72254f72b85391
feaea64919%7C1&sdata=VsgwHZxr0%2FDQN1jeBTJsfyIGu%
2FZkkWhzAvxNvB
N531k%3D&reserved=0

We use a templating system to create/deploy DAGs dynamically
based on
some directory/file structure. This allows analysts to quickly
deploy
and
schedule their ETL code without having to interact with the
Airflow installation directly. It would be great if those same
analysts could access to their own DAGs in the UI so that they
can clear DAG runs,
mark
success, etc. while keeping them away from our core ETL and
other
people's/organization's DAGs. Some of this can be accomplished
with
‘filter
by owner’ but it doesn’t address the use case where a DAG can
be
maintained
by multiple users in the same organization when they have
separate
Airflow
user accounts.

2) An option to turn off backfill:
https://emea01.safelinks.protection.outlook.com/?url=
https%3A%2
F%2Fissues.apache.org<http://2Fissues.apache.org>%2Fjira%2Fbrowse%2FAIRFLOW-558&data=
01%7C0
1%7C%7C0064f74fd0d940ab732808d4100e
9c3f%7C6d4034cd72254f72b8539
1feaea64919%7C1&sdata=Xkz7dTkFMEa4np19m4ML1VajVqVPNy
%2BVSS5Y%2B
Sm8Odk%3D&reserved=0 For cases where a DAG does an insert
overwrite on a table every day.
This might be a realistic option for the current version but I
just
wanted
to call attention to this feature request.

Best,
David

On Nov 17, 2016, at 6:19 PM, Maxime Beauchemin <
[email protected]<mailto:[email protected]><mailto:[email protected]>>
wrote:

*This is a brainstorm email thread about Airflow 2.0!*

I wanted to share some ideas around what I would like to do
in
Airflow
2.0
and would love to hear what others are thinking. I'll compile
the
ideas
that are shared in this thread in a Wiki once the
conversation
fades.

-------------------------------------------

First idea, to get the conversation started:

*Breaking down the package*
`pip install airflow-common airflow-scheduler
airflow-webserver
airflow-operators-googlecloud ...`

It seems to me like we're getting to a point where having
different repositories and different packages would make
things
much easier in
all
sorts of ways. For instance the web server is a lot less
sensitive
than
the
scheduler, and changes to operators should/could be deployed
at
will, independently from the main package. People in their
environment
could
upgrade only certain packages when needed. Travis builds
would
be
more
targeted, and take less time, ...

Also, the whole current "extra_requires" approach to optional
dependencies
(in setup.py) is kind getting out-of-hand.

Of course `pip install airflow` would bring in a collection
of
sub-packages
similar in functionality to what it does now, perhaps without
so many operators you probably don't need in your
environment.

The release process is the main pain-point and the biggest
risk
for
the
project, and I feel like this a solid solution to address it.

Max






--

Sergei




--
*David Batista* *Data Engineer**, HelloFresh Global*
Saarbrücker Str. 37a | 10405 Berlin
[email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>

--

[image: logo]
 <http://www.facebook.com/hellofreshde>   <http://twitter.com/
HelloFreshde>
  <http://instagram.com/hellofreshde/>   <http://blog.hellofresh.de/>
<https://app.adjust.com/ayje08?campaign=Hellofresh&;
deep_link=hellofresh%3A%2F%2F&post_deep_link=https%3A%2F%
2Fwww.hellofresh.com<http://2Fwww.hellofresh.com>%2Fapp%2F%3Futm_medium%3Demail%26utm_
source%3Demail_signature&fallback=https%3A%2F%2Fwww.
hellofresh.com<http://hellofresh.com>%2Fapp%2F%3Futm_medium%3Demail%26utm_source%
3Demail_signature>

*HelloFresh App –Download Now!*
<https://app.adjust.com/ayje08?campaign=Hellofresh&;
deep_link=hellofresh%3A%2F%2F&post_deep_link=https%3A%2F%
2Fwww.hellofresh.com<http://2Fwww.hellofresh.com>%2Fapp%2F%3Futm_medium%3Demail%26utm_
source%3Demail_signature&fallback=https%3A%2F%2Fwww.
hellofresh.com<http://hellofresh.com>%2Fapp%2F%3Futm_medium%3Demail%26utm_source%
3Demail_signature>
*We're active in:*
US <https://www.hellofresh.com/?utm_medium=email&utm_source=
email_signature>
|  DE
<https://www.hellofresh.de/?utm_medium=email&utm_source=email_signature>
|
UK
<https://www.hellofresh.co.uk/?utm_medium=email&utm_source=
email_signature>
|  NL
<https://www.hellofresh.nl/?utm_medium=email&utm_source=email_signature>
|
AU
<https://www.hellofresh.com.au/?utm_medium=email&utm_
source=email_signature>
|  BE
<https://www.hellofresh.be/?utm_medium=email&utm_source=email_signature>
|
AT <https://www.hellofresh.at/?utm_medium=email&utm_source=
email_signature>
|  CH
<https://www.hellofresh.ch/?utm_medium=email&utm_source=email_signature>
|
CA <https://www.hellofresh.ca/?utm_medium=email&utm_source=
email_signature>

www.HelloFreshGroup.com<http://www.HelloFreshGroup.com>
<http://www.hellofreshgroup.com/?utm_medium=email&utm_
source=email_signature>

We are hiring around the world – Click here to join us
<https://www.hellofresh.com/jobs/?utm_medium=email&utm_
source=email_signature>

--

<https://www.hellofresh.com/jobs/?utm_medium=email&utm_
source=email_signature>
HelloFresh AG, Berlin (Sitz der Gesellschaft) | Vorstände: Dominik S.
Richter (Vorsitzender), Thomas W. Griesel, Christian Gärtner |
Vorsitzender
des Aufsichtsrats: Jeffrey Lieberman | Eingetragen beim Amtsgericht
Charlottenburg, HRB 171666 B | USt-Id Nr.: DE 302210417

*CONFIDENTIALITY NOTICE:* This message (including any attachments) is
confidential and may be privileged. It may be read, copied and used only
by
the intended recipient. If you have received it in error please contact
the
sender (by return e-mail) immediately and delete this message. Any
unauthorized use or dissemination of this message in whole or in parts is
strictly prohibited.


________________________________
This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.

Re: Airflow 2.0

Reply via email to