mobuchowski commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-926587909


   >So my understanding of lineage for a data workflow is two things alone:
   
   Just a general thought, but I think all the implementations of 
LineageBackend did something more than pure inputs and outputs tracking. How 
the dataset was produced is at least as important. That means consuming 
metadata.
   
   [Atlas (previously in 
Airflow)](https://github.com/crazzysun/airflow/blob/0f1d0a7e4e14886a7e74e6b0feaace28b68c64ec/airflow/lineage/backend/atlas/__init__.py)
   
[Datahub](https://github.com/linkedin/datahub/blob/214215759011fc983d4bfda16dab5630a02bfa14/metadata-ingestion/src/datahub_provider/_lineage_core.py#L40)
   
   I don't think any of them are interested in failed runs though. 
   
   >In my opinion you are building more than just lineage tracking -- but 
something larger, as , so it's my opnion that these events do not belong in the 
lineage backend interface.
   
   Yes, we're interested in broad metadata around data. Ultimately, you could 
use OpenLineage events in data discovery tool, presenting schemas of your 
datasets. You could build alerting around data quality facets of those events.  
   
   >I'm leaning towards the ability to add/configure global task hook points 
for this sort of thing, rather than forcing something in to the lineage api 
that only OpenLineage wants.
   
   Sure. I'm willing to contribute solution that fits Airflow best. 
   
   >Where should this run? On the scheduler, or the runner?
   
   My best guess is scheduler. Ideally it would be independent of particular 
executor. My question is: could we get same kind of `context` there? For 
example, I'd want to look at `PostgresOperator` instance and get `sql` 
property.   
   
    I'd want to get notification for situation 2) - we scheduled the job, but 
it did not run.
   How is this handled with retries? I think I'd not really need information 
about retry now. I'd want information about rerun though.
   
   Overall, I think key feature here would be to make it as simple for end 
users as possible to use. That means no changes in user DAGs. Ideally we'd use 
similar mechanism to load class as we have with LineageBackend.
   
   For inspiration, I'd look at what Spark does with 
[`SparkListener`](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala).
 There is general api that you can implement to receive various events during 
spark job run. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to