This is really really awesome work. We will definitely be using this!
On Tue, Sep 27, 2016 at 5:57 PM, siddharth anand <[email protected]> wrote: > @gwax added the LatestOnlyOperator > https://github.com/apache/incubator-airflow/pull/1752 > > > This is a really nifty operator, so I wanted to let folks know about it. A > lot of people run cron for a mix of workloads. Some jobs map to traditional > ETL workloads (e.g. load hourly data summarization for 2016-10-01T00:00:00Z). > Some are simple cron tasks -- run a database backup every night. In the > latter case, if you miss 3 runs (e.g. your dag is paused or your start date > is a few days/weeks/months/ago), you don't want to make up for lost time > and backfill all of those days. Essentially, running N database backups at > once will take your database down... We'd prefer traditional cron behavior > in these cases, not ETL behavior. > > *Enter the LatestOnlyOperator.* > > Place this operator upstream of any tasks that you want to skip unless the > Dagrun is the latest. You can place a trigger rule downstream to "end" its > effect. By combining a Trigger Rule with this operator, you can ensure only > portions of your dag honor this "latest only" requirement. Or simply, have > an entire DAG run in "latest only" mode by using the LatestOnlyOperator > alone, i.e. not pairing it with a TriggerRule downstream. > > This is a useful pattern that I have been coding around for some time by > using a ShortCircuitOperator with a python callable, where the callable > evaluates the "latest"-ness of the dag run. I suspect we have all been > re-inventing this wheel, which is where Airflow's Operators shine. > > Thanks to @gwax for implementing this and sticking with a long and often > delayed review/merge process. > > -s
