Re: Single Airflow Instance Vs Multiple Airflow Instance

2018-06-07 Thread Ananth Durai
At Slack, We follow a similar pattern of deploying multiple airflow
instances. Since the Airflow UI & the scheduler coupled, it introduces
friction as the user need to know underlying deployment strategy. (like
which Airflow URL I should visit to see my DAGs, multiple teams
collaborating on the same DAG, pipeline operations, etc.)

In one of the forum question, max mentioned renaming the scheduler to
supervisor as the scheduler do more than just scheduling.
It would be super cool if we can make multiple supervisors share the same
airflow metadata storage and the Airflow UI. (maybe introducing a unique
config param `supervisor.id` for each instance)

The approach will help us to scale Airflow scheduler horizontally and while
keeping the simplicity from the user perspective.


Regards,
Ananth.P,






On 7 June 2018 at 04:08, Arturo Michel  wrote:

> We have had up to 50 dags with multiple tasks each. Many of them run in
> parallel, we've had some issues with compute as it was meant to be a
> temporary deployment but somehow it's now the permanent production one and
> resources are not great.
> Oranisationally it is very similar to what Gerard described. More than one
> group working with different engineering practices and standards, this is
> probably one of the sources of problems.
>
> -Original Message-
> From: Gerard Toonstra 
> Sent: Wednesday, June 6, 2018 5:02 PM
> To: dev@airflow.incubator.apache.org
> Subject: Re: Single Airflow Instance Vs Multiple Airflow Instance
>
> We are using two cluster instances. One cluster is for the engineering
> teams that are in the "tech" wing and which rigorously follow tech
> principles, the other instance is for use by business analysts and more
> ad-hoc, experimental work, who do not necessarily follow the principles. We
> have a nomad engineer helping out the ad-hoc cluster, setting it up,
> connecting it to all systems and resolving programming questions. All
> clusters are fully puppetized, so we reuse configs and ways how things are
> configured, plus have a common "platform code" package that is reused
> across both clusters.
>
> G>
>
>
> On Wed, Jun 6, 2018 at 5:50 PM, James Meickle 
> wrote:
>
> > An important consideration here is that there are several settings
> > that are cluster-wide. In particular, cluster-wide concurrency
> > settings could result in Team B's DAG refusing to schedule based on an
> error in Team A's DAG.
> >
> > Do your teams follow similar practices in how eagerly they ship code,
> > or have similar SLAs for resolving issues? If so, you are probably
> > fine using co-tenancy. If not, you should probably talk about it first
> > to make sure the teams are okay with co-tenancy.
> >
> > On Wed, Jun 6, 2018 at 11:24 AM, gauthiermarti...@gmail.com <
> > gauthiermarti...@gmail.com> wrote:
> >
> > > Hi Everyone,
> > >
> > > We have been experimenting with airflow for about 6 months now.
> > > We are planning to have multiple departments to use it. Since we
> > > don't have any internal experience with Airflow we are wondering if
> > > single instance per department is more suited than single instance
> > > with multi-tenancy? We have been aware about the upcoming release of
> > > airflow
> > > 1.10 and changes that will be made to the RBAC which will be more
> > > suited for multi-tenancy.
> > >
> > > Any advice on this ? Any tips could be helpful to us.
> > >
> >
>
> This e-mail message and any attachments are confidential and are for the
> exclusive use of the addressee only.  If you are not the intended
> recipient, you should not use the content, place any reliance on it or
> disclose it to anyone else.  Please notify the sender immediately by
> replying to it and then ensure that it is deleted from your system
> (including any attachments).
>


Re: Disable Processing of DAG file

2018-05-28 Thread Ananth Durai
It is an interesting question. On a slightly related note, Correct me if
I'm wrong, AFAIK we require restarting airflow scheduler in order pick any
new DAG file changes by the scheduler. In that case, should the scheduler
do the DAGFileProcessing every time before scheduling the tasks?

Regards,
Ananth.P,






On 28 May 2018 at 21:46, ramandu...@gmail.com  wrote:

> Hi All,
>
> We have a use case where there would be 100(s) of DAG files with schedule
> set to "@once". Currently it seems that scheduler processes each and every
> file and creates a Dag Object.
> Is there a way or config to tell scheduler to stop processing certain
> files.
>
> Thanks,
> Raman Gupta
>


Re: Alert Emails Templatizing

2018-05-28 Thread Ananth Durai
It is a bit tricky;
*Step 1:*
you can write an SLA miss callback, send email from the callback and empty
the `slas` object so that airflow won't sent SLA miss email.
https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L688

*Step 2:*
You can reuse airflow `send_email` method here
https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L61

*Step 3:*
If you are sending sla_miss from your callback, you need to mutate the
`sla_miss` table just like
https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L726

I hope this will get simplified in the future releases.

Regards,
Ananth.P,






On 28 May 2018 at 05:48, vardangupta...@gmail.com 
wrote:

> Hi team,
>
> We had a use case where we wanted to serve different email body to
> different use cases at the time of failure & up_for_retry, currently body
> seems to be hard coded in models.py, Is there any plan to make it
> templatized in upcoming future or it will be a good idea if we come across
> with code change and contribute? Please suggest recommended way of
> implementing the feature.
>
>
> Regards,
> Vardan Gupta
>


Re: How to wait for external process

2018-05-28 Thread Ananth Durai
Since you already on AWS, the simplest thing I could think of is to write a
signal file once the job finished and the downstream job waiting for the
signal file. In other words, the same pattern how the Hadoop jobs writing
`_SUCCESS` file and the downstream jobs depends on the signal file.

Regards,
Ananth.P,






On 28 May 2018 at 13:06, Stefan Seelmann  wrote:

> Thanks Christopher for the idea. That would work, we already have such a
> "listener" that polls a queue (SQS) and creates the DAG runs. However it
> would have been nice to have the full process in one DAG to have a
> better overview about running jobs and leverage the gantt chart, but I
> think this can be accomplished via custom plugins and views.
>
> On 05/28/2018 08:43 PM, Christopher Bockman wrote:
> > Haven't done this, but we'll have a similar need in the future, so have
> > investigated a little.
> >
> > What about a design pattern something like this:
> >
> > 1) When jobs are done (ready for further processing) they publish those
> > details to a queue (such as GC Pub/Sub or any other sort of queue)
> >
> > 2) A single "listener" DAG sits and periodically checks that queue.  If
> it
> > finds anything on it, it triggers (via DAG trigger) all of the DAGs which
> > are on the queue.*
> >
> > * = if your triggering volume is too high, this may cause airflow issues
> w/
> > too many going at once; this could presumably be solved then via custom
> > rate-limiting on firing these
> >
> > 3) The listener DAG resets itself (triggers itself)
> >
> >
> > On Mon, May 28, 2018 at 7:17 AM, Driesprong, Fokko  >
> > wrote:
> >
> >> Hi Stefan,
> >>
> >> Afaik there isn't a more efficient way of doing this. DAGs that are
> relying
> >> on a lot of sensors are experiencing the same issues. The only way right
> >> now, I can think of, is doing updating the state directly in the
> database.
> >> But then you need to know what you are doing. I can image that this
> would
> >> be feasible by using an AWS lambda function. Hope this helps.
> >>
> >> Cheers, Fokko
> >>
> >> 2018-05-26 17:50 GMT+02:00 Stefan Seelmann :
> >>
> >>> Hello,
> >>>
> >>> I have a DAG (externally triggered) where some processing is done at an
> >>> external system (EC2 instance). The processing is started by an Airflow
> >>> task (via HTTP request). The DAG should only continue once that
> >>> processing is completed. In a first naive implementation I created a
> >>> sensor that gets the progress (via HTTP request) and only if status is
> >>> "finished" returns true and the DAG run continues. That works but...
> >>>
> >>> ... the external processing can take hours or days, and during that
> time
> >>> a worker is occupied which does nothing but HTTP GET and sleep. There
> >>> will be hundreds of DAG runs in parallel which means hundreds of
> workers
> >>> are occupied.
> >>>
> >>> I looked into other operators that do computation on external systems
> >>> (ECSOperator, AWSBatchOperator) but they also follow that pattern and
> >>> just wait/sleep.
> >>>
> >>> So I want to ask if there is a more efficient way to build such a
> >>> workflow with Airflow?
> >>>
> >>> Kind Regards,
> >>> Stefan
> >>>
> >>
> >
>
>


Re: Improving Airflow SLAs

2018-05-03 Thread Ananth Durai
Since we are talking about the SLA implementation, The current SLA miss
implementation is part of the scheduler code. So in the cases like
scheduler max out the process / not running for some reason, we will miss
all the SLA alert. It is worth to decouple SLA alert from the scheduler
path and run as a separate process.


Regards,
Ananth.P,






On 2 May 2018 at 20:31, David Capwell  wrote:

> We use SLA as well and works great for some DAGs and painful for others
>
> We rely on sensors to validate the data is ready before we run and each dag
> waits on sensors for different times (one dag waits for 8 hours since it
> expects date at the start of day but tends to get it 8 hours later).  We
> also have some nested dags that have about 10 tasks deep.
>
> In these two cases SLA warnings come very late since the semantics we see
> is DAG completion time; what we really want is what you were talking about,
> expected execution times
>
> Also SLA trigger on backfills and manual reruns of tasks
>
> I see this as a critical feature for production monitoring so would love to
> see this get improved
>
> On Wed, May 2, 2018, 12:00 PM James Meickle 
> wrote:
>
> > At Quantopian we use Airflow to produce artifacts based on the previous
> > day's stock market data. These artifacts are required for us to trade on
> > today's stock market. Therefore, I've been investing time in improving
> > Airflow notifications (such as writing PagerDuty and Slack integrations).
> > My attention has turned to Airflow's SLA system, which has some drawbacks
> > for our use case:
> >
> > 1) Airflow SLAs are not skip-aware, so a task that has an SLA but is
> > skipped for this execution date will still trigger emails/callbacks. This
> > is a huge problem for us because we run almost no tasks on weekends
> (since
> > the stock market isn't open).
> >
> > 2) Defining SLAs can be awkward because they are relative to the
> execution
> > date instead of the task start time. There's no way to alert if a task
> runs
> > for "more than an hour", for any non-trivial DAG. Instead you can only
> > express "more than an hour from execution date".  The financial data we
> use
> > varies in when it arrives, and how long it takes to process (data volume
> > changes frequently); we also have tight timelines that make retries
> > difficult, so we want to alert an operator while leaving the task
> running,
> > rather than failing and then alerting.
> >
> > 3) SLA miss emails don't have a subject line containing the instance URL
> > (important for us because we run the same DAGs in both
> staging/production)
> > or the execution date they apply to. When opened, they can get hard to
> read
> > for even a moderately sized DAG because they include a flat list of task
> > instances that are unsorted (neither alpha nor topo). They are also
> lacking
> > any links back to the Airflow instance.
> >
> > 4) SLA emails are not callbacks, and can't be turned off (other than
> either
> > removing the SLA or removing the email attribute on the task instance).
> The
> > way that SLA miss callbacks are defined is not intuitive, as in contrast
> to
> > all other callbacks, they are DAG-level rather than task-level. Also, the
> > call signature is poorly defined: for instance, two of the arguments are
> > just strings produced from the other two arguments.
> >
> > I have some thoughts about ways to fix these issues:
> >
> > 1) I just consider this one a bug. If a task instance is skipped, that
> was
> > intentional, and it should not trigger any alerts.
> >
> > 2) I think that the `sla=` parameter should be split into something like
> > this:
> >
> > `expected_start`: Timedelta after execution date, representing when this
> > task must have started by.
> > `expected_finish`: Timedelta after execution date, representing when this
> > task must have finished by.
> > `expected_duration`: Timedelta after task start, representing how long it
> > is expected to run including all retries.
> >
> > This would give better operator control over SLAs, particularly for tasks
> > deeper in larger DAGs where exact ordering may be hard to predict.
> >
> > 3) The emails should be improved to be more operator-friendly, and take
> > into account that someone may get a callback for a DAG they don't know
> very
> > well, or be paged by this notification.
> >
> > 4.1) All Airflow callbacks should support a list, rather than requiring a
> > single function. (I've written a wrapper that does this, but it would be
> > better for Airflow to just handle this itself.)
> >
> > 4.2) SLA miss callbacks should be task callbacks that receive context,
> like
> > all the other callbacks. Having a DAG figure out which tasks have missed
> > SLAs collectively is fine, but getting SLA failures in a batched callback
> > doesn't really make much sense. Per-task callbacks can be fired
> > individually within a batch of failures detected at the same time.
> >
> > 4.3) SLA 

Re: Awesome list of resources around Apache Airflow

2018-03-28 Thread Ananth Durai
This is awesome. Thanks for sharing the link.

Regards,
Ananth.P,






On 28 March 2018 at 21:30, Tao Feng  wrote:

> Thanks Max for sharing the link. This is great :)
>
> -Tao
>
> On Wed, Mar 28, 2018 at 9:21 PM, Maxime Beauchemin <
> maximebeauche...@gmail.com> wrote:
>
> > I was pleasantly surprised to stumble upon this recently:
> > https://github.com/jghoman/awesome-apache-airflow
> >
> > Please contribute anything that you think is missing from the list!
> >
> > Max
> >
>


Re: Rerunning task without cleaning DB?

2018-02-07 Thread Ananth Durai
We can't do that, unfortunately. Airflow schedule the task based on the
current state in the DB. If you would like to preserve the history one
option would be to add instrumentation on airflow_local_settings.py

Regards,
Ananth.P,






On 5 February 2018 at 13:09, David Capwell  wrote:

> When a production issue happens it's common that we clear the history to
> get airflow to run the task again.  This is problematic since it throws
> away the history making finding out what real happened harder.
>
> Is there any way to rerun a task without deleting from the DB?
>


Re: Q1 Airflow Bay Area Meetup

2018-01-08 Thread Ananth Durai
I can give a talk about all the hacks we did to scale Airflow Local
Executor and improve the data pipeline on-call experience at Slack if folks
are interested.

Regards,
Ananth.P,






On 8 January 2018 at 15:51, George Leslie-Waksman <
geo...@cloverhealth.com.invalid> wrote:

> +1 and would love to hear about other folks running Airflow in GCP.
>
> Would strongly prefer March (but I'm just one person)
>
> On Mon, Jan 8, 2018 at 2:30 PM Feng Lu  wrote:
>
> > +1 Joy!
> >
> > Would prefer April if possible, I can talk about running Airflow in GCP
> if
> > there's sufficient interest.
> >
> > On Fri, Jan 5, 2018 at 12:59 PM, Sid Anand  wrote:
> >
> > > Sounds great Joy!
> > >
> > > I've promoted you to *Event Organizer* on the Bay Area Apache Airflow
> > > Meetup as per instructions/guidelines on
> > > https://cwiki.apache.org/confluence/display/AIRFLOW/Meetups
> > >
> > > Go ahead and start setting up the meetup... we recommend 3 speakers
> with
> > > 1-2 of them being external to the hosting company.
> > >
> > > Once the meetup date is setup, we can add it to
> > > https://cwiki.apache.org/confluence/display/AIRFLOW/Announcements &
> > tweet
> > > it or share it over the dev list.
> > > -s
> > >
> > > On Fri, Jan 5, 2018 at 12:10 PM, Joy Gao  wrote:
> > >
> > > > Hi folks,
> > > >
> > > > I'm Joy from WePay. At the last Airflow meetup in December there was
> a
> > > > demand for hosting a future meetup to cover topics on Airflow
> > > integrations
> > > > with GCP/AWS (for example, CI/CD with GCP/AWS hooks/operators, best
> > > > practices on running Airflow in the cloud, managed Airflow, etc.)
> > > >
> > > > If there is enough interest, WePay can host the next event sometimes
> in
> > > > March/April. And if you would like to give a talk or have another
> topic
> > > > covered, feel free to suggest them as well.
> > > >
> > > > Cheers,
> > > > Joy
> > > >
> > >
> >
>


Airflow parallelism

2017-07-17 Thread Ananth Durai
Hi there,
I'm having a hard time to understand airflow parallelism. I'm running
Airflow with the following configuration

Executor: LocalExecutor
parallelism: 156
Airflow version: 1.8.1

I assume the maximum number of "Task Instances" would be 156 at any given
time, but I constantly see the active "Task Instance" around 250 sometimes.
Wonder why I'm seeing this behavior.

Regards,
Ananth.P,