Hi We are using Airflow to build up a DWH from many source systems but I have a couple of edge cases that I’m not quite sure how to proceed with.
Each table from a source system is supplied as a CSV. With each set of files delivered (daily) we receive a manifest/description file containing a list of files that have been sent. The number of files delivered could vary as supplying systems become un-available for reporting periods, connectivity or for some other outage reason, or when no “transactions”/events occurred. Due to an outage style event, A CSV can contain data that covers “transactions”/events for a date range not just a single day (not normally, but possible). Additionally correction extracts could be supplied that traditionally would result in an update but due to some technology choices, updates become difficult, so ideally we are looking to follow the same patterns described here : https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a We have some code to re-partition each of extract by an event/transaction date before they can be consumed by any follow-on workflows/DAGs. I think what I’m trying to do is trigger a follow on DAG/subDAG for each daily partition generated (but until we've inspected the data, the date range of snapshots/partitions to be generated isn't known), this seems to be analogous to kicking off a backfill, but I’m not sure how/if this can be done as part of the actual pipeline. We can put this logic into our PySpark scripts but it seems we would loose the nice features of the UI describing each stage of the ETL pipeline and MI regarding each task or to detect failure etc. I suppose what I'm asking is whether there is an Airflow idiomatic way of doing this. We would like to retain the ability to perform a backfill and know that we have deterministic behaviour given a set of inputs (a manifest file + set of source files) Hopefully this makes enough sense to people reading. Thanks in advance.
