Hello everyone,
We were wondering if some of you are using a data lineage solution ?

We are aware of the experimental inlets/outlets with Apache 
Atlas<https://airflow.readthedocs.io/en/stable/lineage.html>, does someone have 
feedback to share ?
Does someone have experience with others solutions outside airflow (as all the 
workflow are not necessarily an airflow DAG)?

In my current company, we have hundreds of DAGs that run every day, many of 
which depend on data built by another DAG (DAGs are often chained through 
sensors on partitions or files in buckets, not trigger_dag).  When one of the 
DAGs fails, downstream DAGs will also start failing once their retries expire; 
similarly when we discover a bug in data, we want to mark that data as tainted 
so the challenge resides in determining impacted downstream DAGs (possibly 
having to convert periodicities) and then clear them.

Rebuilding from scratch is not ideal but we haven't found something that suits 
our needs so our idea is to implement the following :

  1.  Build a status table that describes the state of each artifact produced 
by our DAG (valid, estimated, tainted...etc), we think this can be down through 
"on_success_callback" of airflow.
  2.  Create a graph of our artefacts class model and the pipelines producing 
their (time-based) instances so that an airflow sensor can easily know what the 
status of the parent artefacts is through an API. We would use this sensor 
before running the task that creates each artefact.
  3.  From this we can handle the failure use-case: we can create an API that 
takes a DAG and an execution date as input and returns the list of tasks to 
clear and DAGRun to start downstream
  4.  From this we can handle the tainted/backfill use-case : we can build an 
on-the-fly @once DAG which will update the data status table to taint all the 
downstream data sources build from a corrupted one and then clear all dependent 
DAGs from the corrupted one (then wait to be reprocessed..).

Any experience shared will be much appreciate.

Thank you!

Germain T.

Reply via email to