[GitHub] [airflow] mobuchowski commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

GitBox Fri, 24 Sep 2021 05:31:45 -0700


mobuchowski commented on issue #17984:
URL: https://github.com/apache/airflow/issues/17984#issuecomment-926587909

>So my understanding of lineage for a data workflow is two things alone:

Just a general thought, but I think all the implementations of
LineageBackend did something more than pure inputs and outputs tracking. How
the dataset was produced is at least as important. That means consuming
metadata.

[Atlas (previously in
Airflow)](https://github.com/crazzysun/airflow/blob/0f1d0a7e4e14886a7e74e6b0feaace28b68c64ec/airflow/lineage/backend/atlas/__init__.py)

[Datahub](https://github.com/linkedin/datahub/blob/214215759011fc983d4bfda16dab5630a02bfa14/metadata-ingestion/src/datahub_provider/_lineage_core.py#L40)

I don't think any of them are interested in failed runs though.

>In my opinion you are building more than just lineage tracking -- but
something larger, as , so it's my opnion that these events do not belong in the
lineage backend interface.

Yes, we're interested in broad metadata around data. Ultimately, you could
use OpenLineage events in data discovery tool, presenting schemas of your
datasets. You could build alerting around data quality facets of those events.

>I'm leaning towards the ability to add/configure global task hook points
for this sort of thing, rather than forcing something in to the lineage api
that only OpenLineage wants.

Sure. I'm willing to contribute solution that fits Airflow best.

>Where should this run? On the scheduler, or the runner?

My best guess is scheduler. Ideally it would be independent of particular
executor. My question is: could we get same kind of `context` there? For
example, I'd want to look at `PostgresOperator` instance and get `sql`
property.

I'd want to get notification for situation 2) - we scheduled the job, but
it did not run.
How is this handled with retries? I think I'd not really need information
about retry now. I'd want information about rerun though.

Overall, I think key feature here would be to make it as simple for end
users as possible to use. That means no changes in user DAGs. Ideally we'd use
similar mechanism to load class as we have with LineageBackend.

For inspiration, I'd look at what Spark does with
[`SparkListener`](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala).
There is general api that you can implement to receive various events during
spark job run.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] mobuchowski commented on issue #17984: Add possibility to LineageBackend to be notified of task instance execution start and failure

Reply via email to