I'm hitting a bit of an annoying problem and wondering about the best
course of action.
We have several dags:
- a daily ETL job
- several reporting jobs (daily, weekly or monthly) which use the data
from previous ETL jobs
I wish to have a dependency such that the reporting jobs depend upon the
last ETL job that the report uses. We're happy to set depends_on_past
in the ETL job.
Daily jobs are easy - ExternalTaskSensor, job done.
Weekly jobs are a little trickier - we need to work out the
execution_delta - normally +6 for us (we deliberately run a day late to
prioritise other jobs).
Monthly jobs.... this is where I'm struggling - how to work out the
execution_delta. I guess the ideal would be an upgrade from timedelta
to dateutil.relativedelta? tomorrow_ds and ds_add don't help either.
I must admit, ds being the time that's just gone has caused me no end of
brain befudledness, especially when trying to get the initial job right
(so much so that I wrote this up in our DAG README, posting here for
others):
When adding a new job, it's critical to ensure that you've set the
schedule correctly:
- frequency (monthly, weekly, daily)
- schedule_interval ("0 0 2 * *", "0 0 * * 0", "0 0 * * *")
- start_date (choose a day that matches schedule_interval at least one
interval ago)
-- e.g if today is Thursday 2016-06-09, go back in time to when the
schedule will trigger,
then work out what "ds" (execution date) would be (remembering
that's the lapsed date)
--- for a monthly job, last trigger=2016-06-02, ds=2016-05-02
--- for a weekly job, last trigger=2016-06-05, ds=2016-05-29
--- for a daily job, last trigger=2016-06-09, ds=2016-06-08