mobuchowski commented on issue #17984: URL: https://github.com/apache/airflow/issues/17984#issuecomment-926587909
>So my understanding of lineage for a data workflow is two things alone: Just a general thought, but I think all the implementations of LineageBackend did something more than pure inputs and outputs tracking. How the dataset was produced is at least as important. That means consuming metadata. [Atlas (previously in Airflow)](https://github.com/crazzysun/airflow/blob/0f1d0a7e4e14886a7e74e6b0feaace28b68c64ec/airflow/lineage/backend/atlas/__init__.py) [Datahub](https://github.com/linkedin/datahub/blob/214215759011fc983d4bfda16dab5630a02bfa14/metadata-ingestion/src/datahub_provider/_lineage_core.py#L40) I don't think any of them are interested in failed runs though. >In my opinion you are building more than just lineage tracking -- but something larger, as , so it's my opnion that these events do not belong in the lineage backend interface. Yes, we're interested in broad metadata around data. Ultimately, you could use OpenLineage events in data discovery tool, presenting schemas of your datasets. You could build alerting around data quality facets of those events. >I'm leaning towards the ability to add/configure global task hook points for this sort of thing, rather than forcing something in to the lineage api that only OpenLineage wants. Sure. I'm willing to contribute solution that fits Airflow best. >Where should this run? On the scheduler, or the runner? My best guess is scheduler. Ideally it would be independent of particular executor. My question is: could we get same kind of `context` there? For example, I'd want to look at `PostgresOperator` instance and get `sql` property. I'd want to get notification for situation 2) - we scheduled the job, but it did not run. How is this handled with retries? I think I'd not really need information about retry now. I'd want information about rerun though. Overall, I think key feature here would be to make it as simple for end users as possible to use. That means no changes in user DAGs. Ideally we'd use similar mechanism to load class as we have with LineageBackend. For inspiration, I'd look at what Spark does with [`SparkListener`](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala). There is general api that you can implement to receive various events during spark job run. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
