Triggering backfills as part of a pipeline?

Kristian Jones Fri, 06 Apr 2018 16:54:07 -0700

Hi

We are using Airflow to build up a DWH from many source systems but I have a 
couple of edge cases that I’m not quite sure how to proceed with.

Each table from a source system is supplied as a CSV. With each set of files
delivered (daily) we receive a manifest/description file containing a list of
files that have been sent. The number of files delivered could vary as
supplying systems become un-available for reporting periods, connectivity or
for some other outage reason, or when no “transactions”/events occurred. Due to
an outage style event, A CSV can contain data that covers “transactions”/events
for a date range not just a single day (not normally, but possible).

Additionally correction extracts could be supplied that traditionally would
result in an update but due to some technology choices, updates become
difficult, so ideally we are looking to follow the same patterns described here
:

https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a

We have some code to re-partition each of extract by an event/transaction date
before they can be consumed by any follow-on workflows/DAGs.

I think what I’m trying to do is trigger a follow on DAG/subDAG for each daily
partition generated (but until we've inspected the data, the date range of
snapshots/partitions to be generated isn't known), this seems to be analogous
to kicking off a backfill, but I’m not sure how/if this can be done as part of
the actual pipeline. We can put this logic into our PySpark scripts but it
seems we would loose the nice features of the UI describing each stage of the
ETL pipeline and MI regarding each task or to detect failure etc.

I suppose what I'm asking is whether there is an Airflow idiomatic way of doing
this. We would like to retain the ability to perform a backfill and know that
we have deterministic behaviour given a set of inputs (a manifest file + set of
source files)

Hopefully this makes enough sense to people reading.

Thanks in advance.

Triggering backfills as part of a pipeline?

Reply via email to