Great questions. This makes two of us thinking about data lineage inside and/or outside of airflow.
Thank you for the questions Germain. Cheers, Jason > On Jun 6, 2019, at 9:19 AM, Germain TANGUY > <[email protected]> wrote: > > Hello everyone, > We were wondering if some of you are using a data lineage solution ? > > We are aware of the experimental inlets/outlets with Apache > Atlas<https://airflow.readthedocs.io/en/stable/lineage.html>, does someone > have feedback to share ? > Does someone have experience with others solutions outside airflow (as all > the workflow are not necessarily an airflow DAG)? > > In my current company, we have hundreds of DAGs that run every day, many of > which depend on data built by another DAG (DAGs are often chained through > sensors on partitions or files in buckets, not trigger_dag). When one of the > DAGs fails, downstream DAGs will also start failing once their retries > expire; similarly when we discover a bug in data, we want to mark that data > as tainted so the challenge resides in determining impacted downstream DAGs > (possibly having to convert periodicities) and then clear them. > > Rebuilding from scratch is not ideal but we haven't found something that > suits our needs so our idea is to implement the following : > > 1. Build a status table that describes the state of each artifact produced > by our DAG (valid, estimated, tainted...etc), we think this can be down > through "on_success_callback" of airflow. > 2. Create a graph of our artefacts class model and the pipelines producing > their (time-based) instances so that an airflow sensor can easily know what > the status of the parent artefacts is through an API. We would use this > sensor before running the task that creates each artefact. > 3. From this we can handle the failure use-case: we can create an API that > takes a DAG and an execution date as input and returns the list of tasks to > clear and DAGRun to start downstream > 4. From this we can handle the tainted/backfill use-case : we can build an > on-the-fly @once DAG which will update the data status table to taint all the > downstream data sources build from a corrupted one and then clear all > dependent DAGs from the corrupted one (then wait to be reprocessed..). > > Any experience shared will be much appreciate. > > Thank you! > > Germain T.
