We are using Airflow to build up a DWH from many source systems but I have a 
couple of edge cases that I’m not quite sure how to proceed with.

Each table from a source system is supplied as a CSV. With each set of files 
delivered (daily) we receive a manifest/description file containing a list of 
files that have been sent. The number of files delivered could vary as 
supplying systems become un-available for reporting periods, connectivity or 
for some other outage reason, or when no “transactions”/events occurred. Due to 
an outage style event, A CSV can contain data that covers “transactions”/events 
for a date range not just a single day (not normally, but possible).

Additionally correction extracts could be supplied that traditionally would 
result in an update but due to some technology choices, updates become 
difficult, so ideally we are looking to follow the same patterns described here 


We have some code to re-partition each of extract by an event/transaction date 
before they can be consumed by any follow-on workflows/DAGs.

I think what I’m trying to do is trigger a follow on DAG/subDAG for each daily 
partition generated (but until we've inspected the data, the date range of 
snapshots/partitions to be generated isn't known), this seems to be analogous 
to kicking off a backfill, but I’m not sure how/if this can be done as part of 
the actual pipeline. We can put this logic into our PySpark scripts but it 
seems we would loose the nice features of the UI describing each stage of the 
ETL pipeline and MI regarding each task or to detect failure etc.

I suppose what I'm asking is whether there is an Airflow idiomatic way of doing 
this. We would like to retain the ability to perform a backfill and know that 
we have deterministic behaviour given a set of inputs (a manifest file + set of 
source files)

Hopefully this makes enough sense to people reading.

Thanks in advance.

Reply via email to