> Add FK to dag_run to the task_instance table on Postgres so that
task_instances can be uniquely attributed to dag runs.
>  Ensure scheduler can be run continuously without needing restarts.
>  Ensure scheduler can handle tens of thousands of active workflows

+1

We are planning to run around 40,000 tasks a day using airflow and some of
them are critical to give quick feedback to developers. Currently having
execution date to uniquely identify tasks does not work for us since we
mainly trigger dags (instead of running them on schedule). And we collide
with 1 sec granularity on several occasions.  Having a task uuid or
associating dag_run to task_instance as suggested by Sergei table will help
mitigate this issue for us and would make it easy for us to update task
results too. We would be happy to start working on this if it makes sense.

Also we are wondering if there were any work done in community to support
multiple schedulers(or alternates to mysql/Postgres) because 1 scheduler
does not scale for us well and we see slow down of up to couple of minute
sometimes when there are several pending tasks.

Thanks



On Mon, Nov 21, 2016 at 9:57 AM, Chris Riccomini <criccom...@apache.org>
wrote:

> > Ensure scheduler can be run continuously without needing restarts
>
> +1
>
> On Mon, Nov 21, 2016 at 5:25 AM, David Batista <d...@hellofresh.com> wrote:
> > A small request, which might be handy.
> >
> > Having the possibility to select multiple tasks and mark them as
> > Success/Clear/etc.
> >
> > Allow the UI to select individual tasks (i.e., inside the Tree View) and
> > then have a button to mark them as Success/Clear/etc.
> >
> > On 21 November 2016 at 14:22, Sergei Iakhnin <lle...@gmail.com> wrote:
> >
> >> I've been running Airflow on 1500 cores in the context of scientific
> >> workflows for the past year and a half. Features that would be
> important to
> >> me for 2.0:
> >>
> >> - Add FK to dag_run to the task_instance table on Postgres so that
> >> task_instances can be uniquely attributed to dag runs.
> >> - Ensure scheduler can be run continuously without needing restarts.
> Right
> >> now it gets into some ill-determined bad state forcing me to restart it
> >> every 20 minutes.
> >> - Ensure scheduler can handle tens of thousands of active workflows.
> Right
> >> now this results in extremely long scheduling times and inconsistent
> >> scheduling even at 2 thousand active workflows.
> >> - Add more flexible task scheduling prioritization. The default
> >> prioritization is the opposite of the behaviour I want. I would prefer
> that
> >> downstream tasks always have higher priority than upstream tasks to
> cause
> >> entire workflows to tend to complete sooner, rather than scheduling
> tasks
> >> from other workflows. Having a few scheduling prioritization strategies
> >> would be beneficial here.
> >> - Provide better support for manually-triggered DAGs on the UI i.e. by
> >> showing them as queued.
> >> - Provide some resource management capabilities via something like slots
> >> that can be defined on workers and occupied by tasks. Using celery's
> >> concurrency parameter at the airflow server level is too coarse-grained
> as
> >> it forces all workers to be the same, and does not allow proper resource
> >> management when different workflow tasks have different resource
> >> requirements thus hurting utilization (a worker could run 8 parallel
> tasks
> >> with small memory footprint, but only 1 task with large memory footprint
> >> for instance).
> >>
> >> With best regards,
> >>
> >> Sergei.
> >>
> >>
> >> On Mon, Nov 21, 2016 at 2:00 PM Ryabchuk, Pavlo <
> >> ext-pavlo.ryabc...@here.com>
> >> wrote:
> >>
> >> > -1. We extremely rely on data profiling, as a pipeline health
> monitoring
> >> > tool
> >> >
> >> > -----Original Message-----
> >> > From: Chris Riccomini [mailto:criccom...@apache.org]
> >> > Sent: Saturday, November 19, 2016 1:57 AM
> >> > To: dev@airflow.incubator.apache.org
> >> > Subject: Re: Airflow 2.0
> >> >
> >> > > RIP out the charting application and the data profiler
> >> >
> >> > Yes please! +1
> >> >
> >> > On Fri, Nov 18, 2016 at 2:41 PM, Maxime Beauchemin <
> >> > maximebeauche...@gmail.com> wrote:
> >> > > Another point that may be controversial for Airflow 2.0: RIP out the
> >> > > charting application and the data profiler. Even though it's nice to
> >> > > have it there, it's just out of scope and has major security
> >> > issues/implications.
> >> > >
> >> > > I'm not sure how popular it actually is. We may need to run a survey
> >> > > at some point around this kind of questions.
> >> > >
> >> > > Max
> >> > >
> >> > > On Fri, Nov 18, 2016 at 2:39 PM, Maxime Beauchemin <
> >> > > maximebeauche...@gmail.com> wrote:
> >> > >
> >> > >> Using FAB's Model, we get pretty much all of that (REST API,
> >> > >> auth/perms,
> >> > >> CRUD) for free:
> >> > >> https://emea01.safelinks.protection.outlook.com/?url=
> http%3A%2F%2Ffla
> >> > >> sk-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%7C01%
> 7C%7C0064f
> >> > >> 74fd0d940ab732808d4100e9c3f%7C6d4034cd72254f72b85391feaea6
> 4919%7C1&sd
> >> > >> ata=uIJcFlm02IJ0Yo2cYLxAJZlkbCF2ZMk6dR%2FkhazZwVE%3D&reserved=0
> >> > >> quickhowto.html?highlight=rest#exposed-methods
> >> > >>
> >> > >> I'm pretty intimate with FAB since I use it (and contributed to it)
> >> > >> for Superset/Caravel.
> >> > >>
> >> > >> All that's needed is to derive FAB's model class instead of
> >> > >> SqlAlchemy's model class (which FAB's model wraps and adds
> >> > >> functionality to and is 100% compatible AFAICT).
> >> > >>
> >> > >> Max
> >> > >>
> >> > >> On Fri, Nov 18, 2016 at 2:07 PM, Chris Riccomini
> >> > >> <criccom...@apache.org>
> >> > >> wrote:
> >> > >>
> >> > >>> > It may be doable to run this as a different package
> >> > >>> `airflow-webserver`, an
> >> > >>> > alternate UI at first, and to eventually rip out the old UI off
> of
> >> > >>> > the
> >> > >>> main
> >> > >>> > package.
> >> > >>>
> >> > >>> This is the same strategy that I was thinking of for AIRFLOW-85.
> You
> >> > >>> can build the new UI in parallel, and then delete the old one
> later.
> >> > >>> I really think that a REST interface should be a pre-req to any
> >> > >>> large/new UI changes, though. Getting unified so that everything
> is
> >> > >>> driven through REST will be a big win.
> >> > >>>
> >> > >>> On Fri, Nov 18, 2016 at 1:51 PM, Maxime Beauchemin
> >> > >>> <maximebeauche...@gmail.com> wrote:
> >> > >>> > A multi-tenant UI with composable roles on top of granular
> >> > permissions.
> >> > >>> >
> >> > >>> > Migrating from Flask-Admin to Flask App Builder would be an
> >> > >>> > easy-ish win (since they're both Flask). FAB Provides a good
> >> > >>> > authentication and permission model that ships out-of-the-box
> with
> >> > >>> > a REST api. Suffice to define FAB models (derivative of
> >> > >>> > SQLAlchemy's model) and you get a set
> >> > >>> of
> >> > >>> > perms for the model (can_show, can_list, can_add, can_change,
> >> > >>> can_delete,
> >> > >>> > ...) and a set of CRUD REST endpoints. It would also allow us to
> >> > >>> > rip out the authentication backend code out of Airflow and rely
> on
> >> > FAB for that.
> >> > >>> > Also every single view gets permissions auto-created for it, and
> >> > >>> > there
> >> > >>> are
> >> > >>> > easy way to define row-level type filters based on user
> >> permissions.
> >> > >>> >
> >> > >>> > It may be doable to run this as a different package
> >> > >>> `airflow-webserver`, an
> >> > >>> > alternate UI at first, and to eventually rip out the old UI off
> of
> >> > >>> > the
> >> > >>> main
> >> > >>> > package.
> >> > >>> >
> >> > >>> > https://emea01.safelinks.protection.outlook.com/?url=
> https%3A%2F%2
> >> > >>> > Fflask-appbuilder.readthedocs.io%2Fen%2Flatest%2F&data=01%
> 7C01%7C%
> >> > >>> > 7C0064f74fd0d940ab732808d4100e9c3f%
> 7C6d4034cd72254f72b85391feaea64
> >> > >>> > 919%7C1&sdata=8mUPRcf4%2FQUDSbju%2BjLLImalhZeU7tOA%
> 2BFpeO%2BjcEs8%
> >> > >>> > 3D&reserved=0
> >> > >>> >
> >> > >>> > I'd love to carve some time and lead this.
> >> > >>> >
> >> > >>> > On Fri, Nov 18, 2016 at 1:32 PM, Chris Riccomini
> >> > >>> > <criccom...@apache.org
> >> > >>> >
> >> > >>> > wrote:
> >> > >>> >
> >> > >>> >> Full-fledged REST API (that the UI also uses) would be great in
> >> 2.0.
> >> > >>> >>
> >> > >>> >> On Fri, Nov 18, 2016 at 6:26 AM, David Kegley <k...@b23.io>
> >> wrote:
> >> > >>> >> > Hi All,
> >> > >>> >> >
> >> > >>> >> > We have been using Airflow heavily for the last couple months
> >> > >>> >> > and
> >> > >>> it’s
> >> > >>> >> been great so far. Here are a few things we’d like to see
> >> > >>> >> prioritized
> >> > >>> in
> >> > >>> >> 2.0.
> >> > >>> >> >
> >> > >>> >> > 1) Role based access to DAGs:
> >> > >>> >> > We would like to see better role based access through the UI.
> >> > >>> There’s a
> >> > >>> >> related ticket out there but it hasn’t seen any action in a few
> >> > >>> >> months
> >> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url=
> https%3A%2
> >> > >>> >> > F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-85&data=01%
> 7C01
> >> > >>> >> > %7C%7C0064f74fd0d940ab732808d4100e
> 9c3f%7C6d4034cd72254f72b85391
> >> > >>> >> > feaea64919%7C1&sdata=VsgwHZxr0%2FDQN1jeBTJsfyIGu%
> 2FZkkWhzAvxNvB
> >> > >>> >> > N531k%3D&reserved=0
> >> > >>> >> >
> >> > >>> >> > We use a templating system to create/deploy DAGs dynamically
> >> > >>> >> > based on
> >> > >>> >> some directory/file structure. This allows analysts to quickly
> >> > >>> >> deploy
> >> > >>> and
> >> > >>> >> schedule their ETL code without having to interact with the
> >> > >>> >> Airflow installation directly. It would be great if those same
> >> > >>> >> analysts could access to their own DAGs in the UI so that they
> >> > >>> >> can clear DAG runs,
> >> > >>> mark
> >> > >>> >> success, etc. while keeping them away from our core ETL and
> other
> >> > >>> >> people's/organization's DAGs. Some of this can be accomplished
> >> > >>> >> with
> >> > >>> ‘filter
> >> > >>> >> by owner’ but it doesn’t address the use case where a DAG can
> be
> >> > >>> maintained
> >> > >>> >> by multiple users in the same organization when they have
> >> > >>> >> separate
> >> > >>> Airflow
> >> > >>> >> user accounts.
> >> > >>> >> >
> >> > >>> >> > 2) An option to turn off backfill:
> >> > >>> >> > https://emea01.safelinks.protection.outlook.com/?url=
> https%3A%2
> >> > >>> >> > F%2Fissues.apache.org%2Fjira%2Fbrowse%2FAIRFLOW-558&data=
> 01%7C0
> >> > >>> >> > 1%7C%7C0064f74fd0d940ab732808d4100e
> 9c3f%7C6d4034cd72254f72b8539
> >> > >>> >> > 1feaea64919%7C1&sdata=Xkz7dTkFMEa4np19m4ML1VajVqVPNy
> %2BVSS5Y%2B
> >> > >>> >> > Sm8Odk%3D&reserved=0 For cases where a DAG does an insert
> >> > >>> >> > overwrite on a table every day.
> >> > >>> >> This might be a realistic option for the current version but I
> >> > >>> >> just
> >> > >>> wanted
> >> > >>> >> to call attention to this feature request.
> >> > >>> >> >
> >> > >>> >> > Best,
> >> > >>> >> > David
> >> > >>> >> >
> >> > >>> >> > On Nov 17, 2016, at 6:19 PM, Maxime Beauchemin <
> >> > >>> >> maximebeauche...@gmail.com<mailto:maximebeauche...@gmail.com>>
> >> > wrote:
> >> > >>> >> >
> >> > >>> >> > *This is a brainstorm email thread about Airflow 2.0!*
> >> > >>> >> >
> >> > >>> >> > I wanted to share some ideas around what I would like to do
> in
> >> > >>> Airflow
> >> > >>> >> 2.0
> >> > >>> >> > and would love to hear what others are thinking. I'll compile
> >> > >>> >> > the
> >> > >>> ideas
> >> > >>> >> > that are shared in this thread in a Wiki once the
> conversation
> >> > fades.
> >> > >>> >> >
> >> > >>> >> > -------------------------------------------
> >> > >>> >> >
> >> > >>> >> > First idea, to get the conversation started:
> >> > >>> >> >
> >> > >>> >> > *Breaking down the package*
> >> > >>> >> > `pip install airflow-common airflow-scheduler
> airflow-webserver
> >> > >>> >> > airflow-operators-googlecloud ...`
> >> > >>> >> >
> >> > >>> >> > It seems to me like we're getting to a point where having
> >> > >>> >> > different repositories and different packages would make
> things
> >> > >>> >> > much easier in
> >> > >>> all
> >> > >>> >> > sorts of ways. For instance the web server is a lot less
> >> > >>> >> > sensitive
> >> > >>> than
> >> > >>> >> the
> >> > >>> >> > scheduler, and changes to operators should/could be deployed
> at
> >> > >>> >> > will, independently from the main package. People in their
> >> > >>> >> > environment
> >> > >>> could
> >> > >>> >> > upgrade only certain packages when needed. Travis builds
> would
> >> > >>> >> > be
> >> > >>> more
> >> > >>> >> > targeted, and take less time, ...
> >> > >>> >> >
> >> > >>> >> > Also, the whole current "extra_requires" approach to optional
> >> > >>> >> dependencies
> >> > >>> >> > (in setup.py) is kind getting out-of-hand.
> >> > >>> >> >
> >> > >>> >> > Of course `pip install airflow` would bring in a collection
> of
> >> > >>> >> sub-packages
> >> > >>> >> > similar in functionality to what it does now, perhaps without
> >> > >>> >> > so many operators you probably don't need in your
> environment.
> >> > >>> >> >
> >> > >>> >> > The release process is the main pain-point and the biggest
> risk
> >> > >>> >> > for
> >> > >>> the
> >> > >>> >> > project, and I feel like this a solid solution to address it.
> >> > >>> >> >
> >> > >>> >> > Max
> >> > >>> >> >
> >> > >>> >>
> >> > >>>
> >> > >>
> >> > >>
> >> >
> >> --
> >>
> >> Sergei
> >>
> >
> >
> >
> > --
> > *David Batista* *Data Engineer**, HelloFresh Global*
> > Saarbrücker Str. 37a | 10405 Berlin
> > d...@hellofresh.com <em...@hellofresh.com>
> >
> > --
> >
> > [image: logo]
> >   <http://www.facebook.com/hellofreshde>   <http://twitter.com/
> HelloFreshde>
> >    <http://instagram.com/hellofreshde/>   <http://blog.hellofresh.de/>
> > <https://app.adjust.com/ayje08?campaign=Hellofresh&;
> deep_link=hellofresh%3A%2F%2F&post_deep_link=https%3A%2F%
> 2Fwww.hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_
> source%3Demail_signature&fallback=https%3A%2F%2Fwww.
> hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_source%
> 3Demail_signature>
> >
> > *HelloFresh App –Download Now!*
> > <https://app.adjust.com/ayje08?campaign=Hellofresh&;
> deep_link=hellofresh%3A%2F%2F&post_deep_link=https%3A%2F%
> 2Fwww.hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_
> source%3Demail_signature&fallback=https%3A%2F%2Fwww.
> hellofresh.com%2Fapp%2F%3Futm_medium%3Demail%26utm_source%
> 3Demail_signature>
> > *We're active in:*
> > US <https://www.hellofresh.com/?utm_medium=email&utm_source=
> email_signature>
> >  |  DE
> > <https://www.hellofresh.de/?utm_medium=email&utm_source=email_signature>
> |
> > UK
> > <https://www.hellofresh.co.uk/?utm_medium=email&utm_source=
> email_signature>
> > |  NL
> > <https://www.hellofresh.nl/?utm_medium=email&utm_source=email_signature>
> |
> > AU
> > <https://www.hellofresh.com.au/?utm_medium=email&utm_
> source=email_signature>
> >  |  BE
> > <https://www.hellofresh.be/?utm_medium=email&utm_source=email_signature>
> |
> > AT <https://www.hellofresh.at/?utm_medium=email&utm_source=
> email_signature>
> > |  CH
> > <https://www.hellofresh.ch/?utm_medium=email&utm_source=email_signature>
> |
> > CA <https://www.hellofresh.ca/?utm_medium=email&utm_source=
> email_signature>
> >
> > www.HelloFreshGroup.com
> > <http://www.hellofreshgroup.com/?utm_medium=email&utm_
> source=email_signature>
> >
> > We are hiring around the world – Click here to join us
> > <https://www.hellofresh.com/jobs/?utm_medium=email&utm_
> source=email_signature>
> >
> > --
> >
> > <https://www.hellofresh.com/jobs/?utm_medium=email&utm_
> source=email_signature>
> > HelloFresh AG, Berlin (Sitz der Gesellschaft) | Vorstände: Dominik S.
> > Richter (Vorsitzender), Thomas W. Griesel, Christian Gärtner |
> Vorsitzender
> > des Aufsichtsrats: Jeffrey Lieberman | Eingetragen beim Amtsgericht
> > Charlottenburg, HRB 171666 B | USt-Id Nr.: DE 302210417
> >
> > *CONFIDENTIALITY NOTICE:* This message (including any attachments) is
> > confidential and may be privileged. It may be read, copied and used only
> by
> > the intended recipient. If you have received it in error please contact
> the
> > sender (by return e-mail) immediately and delete this message. Any
> > unauthorized use or dissemination of this message in whole or in parts is
> > strictly prohibited.
>

Reply via email to