I'm trying to design a set of DAGs to do a one create and backfill of a set of tables in BigQuery and then perform periodic loads into those tables. I can't quite get it to work the way I want to and I'm wondering if other people have solved similar problems.
The parameters are as follows: I have a list of customers: cust1, cust2, etc I have a list of sources: source1, source2, etc For each pairing of customer and source I want a task that runs hourly to load new data into a unique table for that pair (loads are done via a single BigQuery query). So far that's easy to solve: I have a single DAG and just loop through the two lists creating a task for each pairing. However, I also want to create the tables themselves via Airflow and run a one time backfill for each table. Backfilling hourly is proving to be a very inefficient process. Finally, I also want to be able to add new customers and/or new sources to the lists and have it work. I know I could achieve this with one or more DAGs per customer/source pair, but I had hoped to avoid that explosion in the number of DAGs. The closest I can get is to have two DAGs. The first has a schedule of '@once' and for each customer/source pair has a CreateTable task and a downstream BackfillTable task. The second DAG runs '@hourly' and just has a LoadTable task for each customer/source pair. This works fine for the initial lists, but once the first DAG has run once, it's DagRun is marked as 'success'. If I then add a new customer or source, then the new tasks get in that first task, but the DagRun is never checked again. If I manually switch the DagRun back to 'running' then it picks up the new tasks. Is there some other setup that I'm missing that get's me what I want? Thanks Chris
