If you want to run (daily, rolling weekly, rolling monthly) backups on a daily basis, and they're mostly the same but have some additional dependencies, you can write a DAG factory method, which you call three times. Certain nodes only get added to the longer-than-daily backups.
On Wed, Aug 8, 2018 at 2:03 PM Gabriel Silk <gs...@dropbox.com.invalid> wrote: > Thanks Andy and Taylor for the suggestions -- > > I see how that would work for the case where you want a weekly rollup that > runs on a weekly cadence. > > But what about a rolling weekly or monthly rollup that runs each day? > > On Wed, Aug 8, 2018 at 11:00 AM, Andy Cooper <andy.coo...@astronomer.io> > wrote: > > > To expand on Taylor's idea > > > > I recently wrote a ScheduleBlackoutSensor that would allow you to > prevent a > > task from running if it meets the criteria provided. It accepts an array > of > > args for any number of the criteria so you could leverage this sensor to > > provide "blackout" runs for a range of days of the week. > > > > https://github.com/apache/incubator-airflow/pull/3702/files > > > > For example, > > > > task = ScheduleBlackoutSensor(day_of_week=[0,1,2,3,4,5], dag=dag) > > > > Would prevent a task from running Monday - Saturday, allowing it to run > on > > Sunday. > > > > You could leverage this Sensor as you would any other sensor or you could > > invert the logic so that you would only need to specify > > > > task = ScheduleBlackoutSensor(day_of_week=6, dag=dag) > > > > To "whitelist" a task to run on Sundays. > > > > > > Let me know if you have any questions > > > > On Wed, Aug 8, 2018 at 1:47 PM Taylor Edmiston <tedmis...@gmail.com> > > wrote: > > > > > Gabriel - > > > > > > One approach I've seen for a similar use case is to have multiple > related > > > rollups in one DAG that runs daily, then have the non-daily tasks skip > > most > > > of the time (e.g., weekly only actually executes on Sundays and is > > > parameterized to look at the last 7 days). > > > > > > You could implement that not running part a few ways, but one idea is a > > > sensor in front of the weekly rollup task. Imagine a SundaySensor like > > > return > > > execution_date.weekday() == 6. One thing to keep in mind here is > > > dependence on the DAG's cron schedule being more granular than the > tasks. > > > > > > I think this could generalize into a DayOfWeekSensor / DayOfMonthSensor > > > that would be nice to have. > > > > > > Of course this does mean some scheduler inefficiency on the skip days, > > but > > > as long as those skips are fast and the overall number of tasks is > > small, I > > > can accept that. > > > > > > *Taylor Edmiston* > > > Blog <https://blog.tedmiston.com/> | CV > > > <https://stackoverflow.com/cv/taylor> | LinkedIn > > > <https://www.linkedin.com/in/tedmiston/> | AngelList > > > <https://angel.co/taylor> | Stack Overflow > > > <https://stackoverflow.com/users/149428/taylor-edmiston> > > > > > > > > > On Wed, Aug 8, 2018 at 1:11 PM, Gabriel Silk <gs...@dropbox.com.invalid > > > > > wrote: > > > > > > > Hello Airflow community, > > > > > > > > I have a basic question about how best to model a common data > pipeline > > > > pattern here at Dropbox. > > > > > > > > At Dropbox, all of our logs are ingested and written into Hive in > > hourly > > > > and/or daily rollups. On top of this data we build many weekly and > > > monthly > > > > rollups, which typically run on a daily cadence and compute results > > over > > > a > > > > rolling window. > > > > > > > > If we have a metric X, it seems natural to put the daily, weekly, and > > > > monthly rollups for metric X all in the same DAG. > > > > > > > > However, the different rollups have different dependency structures. > > The > > > > daily job only depends on a single day partition, whereas the weekly > > job > > > > depends on 7, the monthly on 28. > > > > > > > > In Airflow, it seems the two paradigms for modeling dependencies are: > > > > 1) Depend on a *single run of a task* within the same DAG > > > > 2) Depend on *multiple runs of task* by using an ExternalTaskSensor > > > > > > > > I'm not sure how I could possibly model this scenario using approach > > #1, > > > > and I'm not sure approach #2 is the most elegant or performant way to > > > model > > > > this scenario. > > > > > > > > Any thoughts or suggestions? > > > > > > > > > >