Re: Data lineage feedback

Jason Rich Thu, 06 Jun 2019 06:45:33 -0700

Great questions. This makes two of us thinking about data lineage inside and/or 
outside of airflow.


Thank you for the questions Germain. 

Cheers,
Jason

> On Jun 6, 2019, at 9:19 AM, Germain TANGUY 
> <[email protected]> wrote:
> 
> Hello everyone,
> We were wondering if some of you are using a data lineage solution ?
> 
> We are aware of the experimental inlets/outlets with Apache 
> Atlas<https://airflow.readthedocs.io/en/stable/lineage.html>, does someone 
> have feedback to share ?
> Does someone have experience with others solutions outside airflow (as all 
> the workflow are not necessarily an airflow DAG)?
> 
> In my current company, we have hundreds of DAGs that run every day, many of 
> which depend on data built by another DAG (DAGs are often chained through 
> sensors on partitions or files in buckets, not trigger_dag).  When one of the 
> DAGs fails, downstream DAGs will also start failing once their retries 
> expire; similarly when we discover a bug in data, we want to mark that data 
> as tainted so the challenge resides in determining impacted downstream DAGs 
> (possibly having to convert periodicities) and then clear them.
> 
> Rebuilding from scratch is not ideal but we haven't found something that 
> suits our needs so our idea is to implement the following :
> 
>  1.  Build a status table that describes the state of each artifact produced 
> by our DAG (valid, estimated, tainted...etc), we think this can be down 
> through "on_success_callback" of airflow.
>  2.  Create a graph of our artefacts class model and the pipelines producing 
> their (time-based) instances so that an airflow sensor can easily know what 
> the status of the parent artefacts is through an API. We would use this 
> sensor before running the task that creates each artefact.
>  3.  From this we can handle the failure use-case: we can create an API that 
> takes a DAG and an execution date as input and returns the list of tasks to 
> clear and DAGRun to start downstream
>  4.  From this we can handle the tainted/backfill use-case : we can build an 
> on-the-fly @once DAG which will update the data status table to taint all the 
> downstream data sources build from a corrupted one and then clear all 
> dependent DAGs from the corrupted one (then wait to be reprocessed..).
> 
> Any experience shared will be much appreciate.
> 
> Thank you!
> 
> Germain T.

Re: Data lineage feedback

Reply via email to